[PDF] Palgol: A High-Level DSL for Vertex-Centric Graph Processing with Remote Data Access

Abstract

Pregel is a popular distributed computing model for dealing with large-scale graphs. However, it can be tricky to implement graph algorithms correctly and efficiently in Pregel's vertex-centric model, especially when the algorithm has multiple computation stages, complicated data dependencies, or even communication over dynamic internal data structures. Some domain-specific languages (DSLs) have been proposed to provide more intuitive ways to implement graph algorithms, but due to the lack of support for remote access --- reading or writing attributes of other vertices through references --- they cannot handle the above mentioned dynamic communication, causing a class of Pregel algorithms with fast convergence impossible to implement. To address this problem, we design and implement Palgol, a more declarative and powerful DSL which supports remote access. In particular, programmers can use a more declarative syntax called chain access to naturally specify dynamic communication as if directly reading data on arbitrary remote vertices. By analyzing the logic patterns of chain access, we provide a novel algorithm for compiling Palgol programs to efficient Pregel code. We demonstrate the power of Palgol by using it to implement several practical Pregel algorithms, and the evaluation result shows that the efficiency of Palgol is comparable with that of hand-written code.

Full PDF

PPalgol: A High-Level DSL for Vertex-Centric GraphProcessing with Remote Data Access

Yongzhe Zhang

National Institute ofInformatics, Tokyo, Japan [email protected] Hsiang-Shang Ko

National Institute ofInformatics, Tokyo, Japan [email protected] Zhenjiang Hu

National Institute ofInformatics, Tokyo, Japan [email protected]

ABSTRACT

Pregel is a popular distributed computing model for dealingwith large-scale graphs. However, it can be tricky to imple-ment graph algorithms correctly and eﬃciently in Pregel’svertex-centric model, especially when the algorithm has mul-tiple computation stages, complicated data dependencies, oreven communication over dynamic internal data structures.Some domain-speciﬁc languages (DSLs) have been proposedto provide more intuitive ways to implement graph algo-rithms, but due to the lack of support for remote access —reading or writing attributes of other vertices through refer-ences — they cannot handle the above mentioned dynamiccommunication, causing a class of Pregel algorithms withfast convergence impossible to implement.To address this problem, we design and implement Pal-gol, a more declarative and powerful DSL which supportsremote access. In particular, programmers can use a moredeclarative syntax called chain access to naturally specifydynamic communication as if directly reading data on ar-bitrary remote vertices. By analyzing the logic patterns ofchain access, we provide a novel algorithm for compiling Pal-gol programs to eﬃcient Pregel code. We demonstrate thepower of Palgol by using it to implement several practicalPregel algorithms, and the evaluation result shows that theeﬃciency of Palgol is comparable with that of hand-writtencode.

1. INTRODUCTION

The rapid increase of graph data calls for eﬃcient analysison massive graphs. Google’s Pregel [14] is one of the mostpopular frameworks for processing large-scale graphs. It isbased on the bulk-synchronous parallel (BSP) model [21],and adopts the vertex-centric computing paradigm to achievehigh parallelism and scalability. Following the BSP model,a Pregel computation is split into supersteps mediated by message passing.

Within each superstep, all the vertices ex-ecute the same user-deﬁned function compute() in parallel,where each vertex can read the messages sent to it in the previous superstep, modify its own state, and send messagesto other vertices. Global barrier synchronization happensat the end of each superstep, delivering messages to theirdesignated receivers before the next superstep. Despite itssimplicity, Pregel has demonstrated its usefulness in imple-menting many interesting graph algorithms [14, 15, 18, 22,24].Despite the power of Pregel, it is a big challenge to imple-ment a graph algorithm correctly and eﬃciently in it [24],especially when the algorithm consists of multiple stages andcomplicated data dependencies. For such algorithms, pro-grammers need to write an exceedingly complicated com-pute() function as the loop body, which encodes all thestages of the algorithm. Message passing makes the codeeven harder to maintain, because one has to trace wherethe messages are from and what information they carry ineach superstep. Some attempts have been made to easePregel programming by proposing domain-speciﬁc languages(DSLs), such as Green-Marl [10] and Fregel [4]. These DSLsallow programmers to write a program in a compositionalway to avoid writing a complicated loop body, and provideneighboring data access to avoid explicit message passing.Furthermore, programs written in these DSLs can be auto-matically translated to Pregel by fusing the components inthe programs into a single loop, and mapping neighboringdata access into message passing. However, for eﬃcient im-plementation, the existing DSLs impose a severe restrictionon data access — each vertex can only access data on theirneighboring vertices. In other words, they do not supportgeneral remote data access — reading or writing attributesof other vertices through references.Remote data access is, however, important for describ-ing a class of Pregel algorithms that aim to accelerate in-formation propagation (which is a crucial issue in handlinggraphs with large diameters [24]) by maintaining a dynamicinternal structure for communication. For instance, a par-allel pointer jumping algorithm maintains a tree (or list)structure in a distributed manner by letting each vertexstore a reference to its current parent (or predecessor), andduring the computation, every vertex constantly exchangesdata with the current parent (or predecessor) and modiﬁesthe reference to reach the root vertex (or the head of thelist). Such computational patterns can be found in the al-gorithms like the Shiloach-Vishkin connected component al-gorithm [24] (see Section 2.4 for more details), the list rank-ing algorithm (see Section 2.5) and Chung and Condon’sminimum spanning forest (MSF) algorithm [3]. However,these computational patterns cannot be implemented with1 a r X i v : . [ c s . D C ] S e p nly neighboring access, and therefore cannot be expressedin any of the existing high-level DSLs.It is, in fact, hard to equip DSLs with eﬃcient remotereading. First, when translated into Pregel’s message pass-ing model, remote reads require multiple rounds of commu-nication to exchange information between the reading vertexand the remote vertex, and it is not obvious how the commu-nication cost can be minimized. Second, remote reads wouldintroduce more involved data dependencies, making it diﬃ-cult to fuse program components into a single loop. Thingsbecome more complicated when there is chain access , wherea remote vertex is reached by following a series of references.Furthermore, it is even harder to equip DSLs with remotewrites in addition to remote reads. For example, Green-Marldetects read/write conﬂicts, which complicate its program-ming model; Fregel has a simpler functional model, which,however, cannot support remote writing without major ex-tension. A more careful design is required to make remotereads and writes eﬃcient and friendly to programmers.In this paper, we propose a more powerful DSL calledPalgol that supports remote data access. In more detail: • We propose a new high-level model for vertex-centriccomputation, where the concept of algorithmic super-steps is introduced as the basic computation unit forconstructing vertex-centric computation in such a waythat remote reads and writes are ordered in a safe way. • Based on the new model, we design and implementPalgol, a more declarative and powerful DSL, whichsupports both remote reads and writes, and allows pro-grammers to use a more declarative syntax called chainaccess to directly read data on remote vertices. For ef-ﬁcient compilation from Palgol to Pregel, we developa logic system to compile chain access to eﬃcient mes-sage passing where the number of supersteps is reducedwhenever possible. • We demonstrate the power of Palgol by working on aset of representative examples, including the Shiloach-Vishkin connected component algorithm and the listranking algorithm, which use communication over dy-namic data structures to achieve fast convergence. • The result of our evaluation is encouraging. The eﬃ-ciency of Palgol is comparable with hand-written codefor many representative graph algorithms on practicalbig graphs, where execution time varies from a 2 . .

42% slowdown in ordinary cases, whilethe worst case is less than a 30% slowdown.The rest of the paper is organized as follows. Section 2introduces algorithmic supersteps and the essential parts ofPalgol, Section 3 presents the compiling algorithm, and Sec-tion 4 presents evaluation results. Related work is discussedin Section 5, and Section 6 concludes this paper with someoutlook.

2. THE PALGOL LANGUAGE

This section ﬁrst introduces a high-level vertex-centricprogramming model (Section 2.1), in which an algorithm Palgol stands for P regel algo rithmic l anguage. The systemwith all implementation codes and test examples is availableat https://bitbucket.org/zyz915/palgol . is decomposed into atomic vertex-centric computations andhigh-level combinators, and a vertex can access the entiregraph through the references it stores locally. Next we de-ﬁne the Palgol language based on this model, and explainits syntax and semantics (Section 2.2). Then, we show howto write the classic shortest path algorithm in Palgol. Fi-nally we use two representative examples — the Shiloach-Vishkin connected component algorithm (Section 2.4) andthe list ranking algorithm (Section 2.5) — to demonstratehow Palgol can concisely describe vertex-centric algorithmswith dynamic internal structures using remote access. The high-level model we propose uses remote reads andwrites instead of message passing to allow programmers todescribe vertex-centric computation more intuitively. More-over, the model remains close to the Pregel computationmodel, in particular keeping the vertex-centric paradigmand barrier synchronization, making it possible to automati-cally derive a valid and eﬃcient Pregel implementation froman algorithm description in this model, and in particular ar-range remote reads and writes without data conﬂicts.In our high-level model, the computation is constructedfrom some basic components which we call algorithmic su-persteps . An algorithmic superstep is a piece of vertex-centric computation which takes a graph containing a setof vertices with local states as input, and outputs the sameset of vertices with new states. Using algorithmic superstepsas basic building blocks, two high-level operations sequence and iteration can be used to glue them together to describemore complex vertex-centric algorithms that are iterativeand/or consist of multiple computation stages: the sequence operation concatenates two algorithmic supersteps by tak-ing the result of the ﬁrst step as the input of the second one,and the iteration operation repeats a piece of vertex-centriccomputation until some termination condition is satisﬁed.The distinguishing feature of algorithmic supersteps is re-mote access. Within each algorithmic superstep (illustratedin Figure 1), all vertices compute in parallel, performingthe same computation speciﬁed by programmers. A vertexcan read the ﬁelds of any vertex in the input graph; it canalso write to arbitrary vertices to modify their ﬁelds, but thewrites are performed on a separate graph rather than the in-put graph (so there are no read-write conﬂicts). We furtherdistinguish local writes and remote writes in our model: lo-cal writes can only modify the current vertex’s state, and areﬁrst performed on an intermediate graph (which is initially acopy of the input graph); next, remote writes are propagatedto the destination vertices to further modify their interme-diate states. Here, a remote write consists of a remote ﬁeld,a value and an “accumulative” assignment (like += and |= ),and that ﬁeld of the destination vertex is modiﬁed by exe-cuting the assignment with the value on its right-hand side.We choose to support only accumulative assignments so thatthe order of performing remote writes does not matter.More precisely, an algorithmic superstep is divided intothe following two phases: • a local computation (LC) phase, in which a copy of theinput graph is created as the intermediate graph, andthen each vertex can read the state of any vertex in theinput graph, perform local computation, and modifyits own state in the intermediate graph, and2 · · · · ·· · · compute compute compute · · · · · · input graphoutput graphintermedi-ate graph initial/ﬁnalvertex stateﬁeld readslocal writesremote writesintermediatevertex stateaccumulativeoperator Figure 1: In an algorithmic superstep, every vertexperforms local computation (including ﬁeld readsand local writes) and remote updating in order. • a remote updating (RU) phase, in which each vertexcan modify the states of any vertices in the intermedi-ate graph by sending remote writes. After processingall remote writes are processed, the intermediate graphis returned as the output graph.Among these two phases, the RU phase is optional, in whichcase the intermediate graph produced by the LC phase isused directly as the ﬁnal result. We present our DSL Palgol next, whose design follows thehigh-level model we introduced in the previous subsection.Figure 2 shows the essential part of the syntax of Palgol.As described by the syntactic category step , an algorithmicsuperstep in Palgol is a code block enclosed by “ for var inV ” and “ end ”, where var is a variable name that can beused in the code block for referring to the current vertex(and V stands for the set of vertices of the input graph).Such steps can then be composed (by sequencing) or iterateduntil a termination condition is met (by enclosing them in“ do ” and “ until . . . ”). Palgol supports several kinds oftermination condition, but in this paper we focus on onlyone kind of termination condition called ﬁxed point , sinceit is extensively used in many algorithms. The semanticsof ﬁxed-point iteration is iteratively running the programenclosed by do and until , until the speciﬁed ﬁelds stabilize.Corresponding to an algorithmic superstep’s remote ac-cess capabilities, in Palgol we can read a ﬁeld of an arbi-trary vertex using a global ﬁeld access expression of the form ﬁeld [ exp ], where ﬁeld is a user-speciﬁed ﬁeld name and exp should evaluate to a vertex id. Such expression can be up-dated by local or remote assignments, where an assignmentto a remote vertex should always be accumulative and pre-ﬁxed with the keyword remote . One more thing aboutremote assignments is that they take eﬀect only in the RUphase (after the LC phase), regardless of where they occurin the program.There are some predeﬁned ﬁelds that have special meaningin our language. Nbr is the edge list in undirected graphs,and In and Out respectively store incoming and outgoingedges for directed graphs. Essentially, these are normal ﬁeldsof a predeﬁned type for representing edges, and most impor-tantly, the compiler assumes a form of symmetry on theseﬁelds (namely that every edge is stored consistently on bothof its end vertices), and uses the symmetry to produce more int = integer ﬂoat = ﬂoating-point number var = identiﬁer starting with lowercase letter ﬁeld = identiﬁer starting with capital letter prog ::= step | prog . . . prog n | iteriter ::= do (cid:104) prog (cid:105) until ﬁx [ ﬁeld , . . . , ﬁeld n ] step ::= for var in V (cid:104) block (cid:105) end block ::= stmt . . . stmt n stmt ::= if exp (cid:104) block (cid:105) | if exp (cid:104) block (cid:105) else (cid:104) block (cid:105)| for ( var ← exp ) (cid:104) block (cid:105)| let var = exp | local opt ﬁeld [ var ] op local exp | remote ﬁeld [ exp ] op remote expexp ::= int | ﬂoat | var | true | false | inf | fst exp | snd exp | ( exp , exp ) | exp . ref | exp . val | { exp , exp } | { exp }| exp ? exp : exp | ( exp ) | exp op b exp | op u exp | ﬁeld [ exp ] | func opt [ exp | var ← exp , exp , . . . , exp n ] func ::= maximum | minimum | sum | . . . Figure 2: Essential part of Palgol’s syntax. Palgolis indentation-based, and two special tokens ‘ (cid:104) ’ and‘ (cid:105) ’ are introduced to delimit indented blocks. eﬃcient code. Then, Id is an immutable ﬁeld that storesthe vertex identiﬁer for each vertex (required by the Pregelframework), whose type is user-speciﬁed but currently wecan simply treat it as an integer.The rest of the syntax for Palgol steps is similar to an or-dinary programming language. Particularly, we introduce aspecialized pair type (expressions in the form of { exp , exp } )for representing a reference with its corresponding value(e.g., an edge in a graph), and use . ref and . val respec-tively to access the reference and the value respectively, tomake the code easy to read. Some functional programmingconstructs are also used here, like let-binding and list com-prehension. There is also a foreign function interface thatallows programmers to invoke functions written in a general-purpose language, but we omit the detail from the paper. The single-source shortest path problem is among the bestknown in graph theory and arises in a wide variety of appli-cations. The idea of this algorithm is fairly simple, which isan iterative computation until the following equation holds: dist [ v ] = (cid:40) v is the sourcemin u ∈ In ( v ) ( dist [ u ] + len ( v, u )) otherwiseWe can concisely capture the essence of the shortest pathalgorithm in a Palgol program, as shown in Figure 3. Inthis program, we store the distance of each vertex from thesource in the D ﬁeld, and use a boolean ﬁeld A to indicatewhether the vertex is active. There are two steps in this pro-gram. In the ﬁrst step (lines 1–4), every vertex initializes itsown distance and the A ﬁeld. Then comes the iterative step(lines 6–13) inside do . . . until ﬁx [ D ], which runs until ev-ery vertex’s distance stabilizes. Using a list comprehension(lines 7–8), each vertex iterates over all its active incomingneighbors (those whose A ﬁeld is true), and generates a list3ontaining the sums of their current distances and the cor-responding edge weights. More speciﬁcally, the list compre-hension goes through every edge e in the incoming edge list In [ u ] such that A [ e. ref ] is true, and puts D [ e. ref ] + e. val in the generated list, where e. ref represents the neighbor’svertex id and e. val the edge weight. Finally, we pick theminimum value from the generated list as minD , and up-date the local ﬁelds.1 for u in V D[u] := ( Id [u] == 0 ? 0 : inf ) A[u] := ( Id [u] == 0) end do for u in V let minD = minimum [ D[e. ref ] + e. val | e <- In [u], A[e. ref ] ] A[u] := false if (minD < D[u]) A[u] := true D[u] := minD end until fix [D] Figure 3: The SSSP program in Palgol

Here is our ﬁrst representative Palgol example: the

Shiloach-Vishkin (S-V) connected component algorithm [24], whichcan be expressed as the Palgol program in Figure 4. A tradi-tional HashMin connected component algorithm [24] basedon neighborhood access takes time proportional to the inputgraph’s diameter, which can be large in real-world graphs.In contrast, the S-V algorithm can calculate the connectedcomponents of an undirected graph in a logarithmic numberof supersteps; to achieve this fast convergence, the capabilityof accessing data on non-neighboring vertices is essential.In the S-V algorithm, the connectivity information is main-tained using the classic disjoint set data structure [6]. Specif-ically, the data structure is a forest, and vertices in the sametree are regarded as belonging to the same connected com-ponent. Each vertex maintains a parent pointer that eitherpoints to some other vertex in the same connected compo-nent, or points to itself, in which case the vertex is the rootof a tree. We henceforth use D [ u ] to represent this pointerfor each vertex u . The S-V algorithm is an iterative algo-rithm that begins with a forest of n root nodes, and in eachstep it tries to discover edges connecting diﬀerent trees andmerge the trees together. In a vertex-centric way, every ver-tex u performs one of the following operations depending onwhether its parent D [ u ] is a root vertex: • tree merging: if D [ u ] is a root vertex, then u choosesone of its neighbors’ current parent (to which we givea name t ), and makes D [ u ] point to t if t < D [ u ](to guarantee the correctness of the algorithm). Whenhaving multiple choices in choosing the neighbors’ par-ent p , or when diﬀerent vertices try to modify the sameparent vertex’s pointer, the algorithm always uses the“minimum” as the tiebreaker for fast convergence. • pointer jumping: if D [ u ] is not a root vertex, then u modiﬁes its own pointer to its current “grandfather” ( D [ u ]’s current pointer). This operation reduces u ’sdistance to the root vertex, and will eventually make u a direct child of the root vertex so that it can performthe above tree merging operation.The algorithm terminates when all vertices’ pointers do notchange after an iteration, in which case all vertices pointto some root vertex and no more tree merging can be per-formed. Readers interested in the correctness of this algo-rithm are referred to the original paper [24] for more details.The implementation of this algorithm is complicated, whichcontains roughly 120 lines of code for the compute() func-tion alone. Even for detecting whether the parent vertex D [ u ] is a root vertex for each vertex u , it has to be translatedinto three supersteps containing a query-reply conversationbetween each vertex and its parent. In contrast, the Palgolprogram in Figure 4 can describe this algorithm concisely in13 lines, due to the declarative remote access syntax. Thispiece of code contains two steps, where the ﬁrst one (lines1–3) performs simple initialization, and the other (lines 5–12) is inside an iteration as the main computation. We alsouse the ﬁeld D to store the pointer to the parent vertex.Let us focus on line 6, which checks whether u ’s parent isa root. Here we simply check D [ D [ u ]] == D [ u ], i.e., whetherthe pointer of the parent vertex D [ D [ u ]] is equal to the par-ent’s id D [ u ]. This expression is completely declarative, inthe sense that we only specify what data is needed and whatcomputation we want to perform, instead of explicitly im-plementing the message passing scheme.1 for u in V local D[u] := u end do for u in V if (D[D[u]] == D[u]) let t = minimum [ D[e.id] | e <- Nbr [u] ] if (t < D[u]) remote D[D[u]]

The rest of the algorithm can be straightforwardly asso-ciated with the Palgol program. If u ’s parent is a root, wegenerate a list containing all neighboring vertices’ parent id( D [ e. ref ]), and then bind the minimum one to the variable t (line 7). Now t is either inf if the neighbor list is empty or avertex id; in both cases we can use it to update the parent’spointer (lines 9–10) via a remote assignment. One impor-tant thing is that the parent vertex ( D [ u ]) may receive manyremote writes from its children, where only one of the chil-dren providing the minimum t can successfully perform theupdating. Here, the statement a

Another example is the list ranking algorithm, which alsoneeds communication over a dynamic structure during com-putation. Consider a linked list L with n elements, whereeach element u stores a value val ( u ) and a link to its prede-cessor pred ( u ). At the head of L is a virtual element v suchthat pred ( v ) = v and val ( v ) = 0. For each element u in L ,deﬁne sum ( u ) to be the sum of the values of all the elementsfrom u to the head (following the predecessor links). The listranking problem is to compute sum ( u ) for each element u . If val ( u ) = 1 for every vertex u in L , then sum ( u ) is simply therank of u in the list. List ranking can be solved using a typ-ical pointer-jumping algorithm in parallel computing with astrong performance guarantee. Yan et al. [24] demonstratedhow to compute the pre-ordering numbers for all verticesin a tree in O (log n ) supersteps using this algorithm, as aninternal step to compute bi-connected components (BCC). We give the Palgol implementation of list ranking in Fig-ure 5 (which is a 10-line program, whereas the Pregel im-plementation contains around 60 lines of code). Sum [ u ] isinitially set to Val [ u ] for every u at line 2; inside the ﬁxed-point iteration (lines 5–9), every u moves Pred [ u ] toward thehead of the list and updates Sum [ u ] to maintain the invari-ant that Sum [ u ] stores the sum of a sublist from itself tothe successor of Pred [ u ]. Line 6 checks whether u points tothe virtual head of the list, which is achieved by checking Pred [ Pred [ u ]] == Pred [ u ], i.e., whether the current predeces-sor Pred [ u ] points to itself. If the current predecessor isnot the head, we add the sum of the sublist maintained in Pred [ u ] to the current vertex u , by reading Pred [ u ]’s Sum and

Pred ﬁelds and modifying u ’s own ﬁelds accordingly.Note that since all the reads are performed on a snapshot ofthe input graph and the assignments are performed on anintermediate graph, there is no need to worry about datadependencies. In some Pregel algorithms, we may want to inactivate ver-tices during computation. Typical examples include somematching algorithms like randomized bipartite matching [14]and approximate maximum weight matching [18], wherematched vertices are no longer needed in subsequent com-putation, and the minimum spanning forest algorithm [18]where the graph gradually shrinks during computation. BCC is a complicated algorithm, whose eﬃcient implemen-tation requires constructing an intermediate graph, which iscurrently beyond Palgol’s capabilities. Palgol is powerfulenough to express the rest of the algorithm, however. In Palgol, we model the behavior of inactivating verticesas a special Palgol step, which can be freely composed withother Palgol programs. The syntactic category of step isnow deﬁned as follows: step ::= for var in V (cid:104) block (cid:105) end | stop var where exp The special Palgol step stops those vertices satisfying thecondition speciﬁed by the boolean-valued expression exp ,which can refer to the current vertex var . The semanticsof stopping vertices is diﬀerent from Pregel’s voting to haltmechanism. In Pregel, an inactive vertex can be activatedby receiving messages, but such semantics is unsuitable forPalgol, since we already hide message passing from program-mers. Instead, a stopped vertex in Palgol will become im-mutable and never perform any subsequent local computa-tion, but other vertices can still access its ﬁelds. This featureis still experimental and we do not further discuss it in thispaper; it is, however, essential for achieving the performancereported in Section 4.

3. COMPILING PALGOL TO PREGEL

In this section, we present the compiling algorithm totransform Palgol to Pregel. The task overall is complicatedand highly technical, but the main problems are the follow-ing two: how to translate Palgol steps into Pregel super-steps, and how to implement sequence and iteration, whichwill be presented in Section 3.2 and Section 3.3 respectively.When compiling a single Palgol step, the most challengingpart is the remote reads, for which we ﬁrst give a detailedexplanation in Section 3.1. We also mention an optimizationbased on Pregel’s combiners in Section 3.4.

In current Palgol, our compiler recognizes two forms ofremote reads. The ﬁrst one is called consecutive ﬁeld access (or chain access for short), which uses nested ﬁeld accessexpressions to acquire remote data. The second one is called neighborhood access where a vertex may use chain access toacquire data from all its neighbors, and this can be describedusing the list comprehension (e.g., line 7 in Figure 4) orfor-loop syntax in Palgol. The combination of these tworemote read patterns is already suﬃcient to express quite awide range of practical Pregel algorithms according to ourexperience. In this section, we present the key algorithms tocompile these two remote read patterns to message passingin Pregel.

Deﬁnition and challenge of compiling : Let us beginfrom the ﬁrst case of remote reads, which is consecutive ﬁeldaccess expressions (or chain access) starting from the currentvertex. As an example, supposing that the current vertexis u , and D is a ﬁeld for storing a vertex id, then D [ D [ u ]] is aconsecutive ﬁeld access expression, and so is D [ D [ D [ D [ u ]]]](which we abbreviate to D [ u ] in the rest of this section).Generally speaking, there is no limitation on the depth ofa chain access or the number of ﬁelds involved in the chainaccess.As a simple example of the compilation, to evaluate D [ D [ u ]]on every vertex u , a straightforward scheme is a request-reply conversation which takes two rounds of communica-tion: in the ﬁrst superstep, every vertex u sends a request to5 ∀ u. K u u ) ∧ ( ∀ u. K u D [ u ] ) = ⇒ ∀ u. K D [ u ] u ( ∀ u. K D [ u ] u ) ∧ ( ∀ u. K D [ u ] D [ u ]) = ⇒ ∀ u. K D [ u ] u ( ∀ u. K D [ u ] D [ u ]) ∧ ( ∀ u. K D [ u ] u ) = ⇒ ∀ u. K u D [ u ]( ∀ u. K D [ u ] D [ u ]) ∧ ( ∀ u. K D [ u ] u ) = ⇒ ∀ u. K u D [ u ] Figure 6: A derivation of ∀ u. K u D [ u ](the vertex whose id is) D [ u ] and the request message shouldcontain u ’s own id; then in the second superstep, those ver-tices receiving the requests should extract the sender’s idsfrom the messages, and reply its D ﬁeld to them.When the depth of such chain access increases, it is nolonger trivial to ﬁnd an eﬃcient scheme, where eﬃciencyis measured in terms of the number of supersteps taken.For example, to evaluate D [ u ] on every vertex u , a simplequery-reply method takes six rounds of communication byevaluating D [ u ], D [ u ] and D [ u ] in turn, each taking tworounds, but the evaluation can actually be done in only threerounds with our compilation algorithm, which is not basedon request-reply conversations. Logic system for compiling chain access : The key in-sight leading to our compilation algorithm is that we shouldconsider not only the expression to evaluate but also the ver-tex on which the expression is evaluated. To use a slightlymore formal notation (inspired by Halpern and Moses [8]),we write ∀ u. K v ( u ) e ( u ), where v ( u ) and e ( u ) are chain ac-cess expressions starting from u , to describe the state whereevery vertex v ( u ) “knows” the value of the expression e ( u );then the goal of the evaluation of D [ u ] can be described as ∀ u. K u D [ u ]. Having introduced the notation, the problemcan now be treated from a logical perspective, where we aimto search for a derivation of a target proposition from a fewaxioms.There are three axioms in our logic system:1. ∀ u. K u u ∀ u. K u D [ u ]3. ( ∀ u. K w ( u ) e ( u )) ∧ ( ∀ u. K w ( u ) v ( u )) = ⇒ ∀ u. K v ( u ) e ( u )The ﬁrst axiom says that every vertex knows its own id,and the second axiom says every vertex can directly accessits local ﬁeld D . The third axiom encodes message passing:if we want every vertex v ( u ) to know the value of the ex-pression e ( u ), then it suﬃces to ﬁnd an intermediate vertex w ( u ) which knows both the value of e ( u ) and the id of v ( u ),and thus can send the value to v ( u ). As an example, Fig-ure 6 shows the solution generated by our algorithm to solve ∀ u. K u D [ u ], where each line is an instance of the messagepassing axiom.Figure 7 is a direct interpretation of the implications inFigure 6. To reach ∀ u. K u D [ u ], only three rounds of com-munication are needed. Each solid arrow represents an in-vocation of the message passing axiom in Figure 6, andthe dashed arrows represent two logical inferences, one from ∀ u. K u D [ u ] to ∀ u. K D [ u ] D [ u ] and the other from ∀ u. K u D [ u ]to ∀ u. K D [ u ] D [ u ].The derivation of ∀ u. K u D [ u ] is not unique, and thereare derivations that correspond to ineﬃcient solutions —for example, there is also a derivation for the six-round so-lution based on request-reply conversations. However, when Step 1:Step 2:Step 3:Step 4: message passinglogical inference u knows uu knows D [ u ] D [ u ] knows uD [ u ] knows D [ D [ u ]] u knows D [ D [ u ]] D [ D [ u ]] knows uD [ D [ u ]] knows D [ u ] u knows D [ u ] Figure 7: Interpretation of the derivation of ∀ u. K u D [ u ]searching for derivations, our algorithm will minimize thenumber of rounds of communication, as explained below. The compiling algorithm : The algorithm starts froma proposition ∀ u. K v ( u ) e ( u ). The key problem here is tochoose a proper w ( u ) so that, by applying the message pass-ing axiom backwards, we can get two potentially simplernew target propositions ∀ u. K w ( u ) e ( u ) and ∀ u. K w ( u ) v ( u )and solve them respectively. The range of such choices is ingeneral unbounded, but our algorithm considers only thosesimpler than v ( u ) or e ( u ). More formally, we say that a isa subpattern of b , written a (cid:22) b , exactly when b is a consec-utive ﬁeld access expression starting from a . For example, u and D [ u ] are subpatterns of D [ D [ u ]], while they are allsubpatterns of D [ u ]. The range of intermediate vertices weconsider is then Sub( e ( u ) , v ( u )), where Sub is deﬁned bySub( a, b ) = { c | c (cid:22) a or c ≺ b } We can further simplify the new target propositions withthe following function before solving them: generalize ( ∀ u. K a ( u ) b ( u )) = (cid:40) ∀ u. K u ( b ( u ) /a ( u )) if a ( u ) (cid:22) b ( u ) ∀ u. K a ( u ) b ( u ) otherwisewhere b ( u ) /a ( u ) denotes the result of replacing the inner-most a ( u ) in b ( u ) with u . (For example, A [ B [ C [ u ]]] /C [ u ] = A [ B [ u ]].) This is justiﬁed because the original propositioncan be instantiated from the new proposition. (For example, ∀ u. K C [ u ] A [ B [ C [ u ]]] can be instantiated from ∀ u. K u A [ B [ u ]].)It is now possible to ﬁnd an optimal solution with respectto the following inductively deﬁned function step , which cal-culates the number of rounds of communication for a propo-sition: step ( ∀ u. K u u ) = 0 step ( ∀ u. K u D [ u ]) = 0 step ( ∀ u. K v ( u ) e ( u )) = 1 + min w ( u ) ∈ Sub( e ( u ) ,v ( u )) max( x, y )where x = step ( generalize ( ∀ u. K w ( u ) e ( u ))) y = step ( generalize ( ∀ u. K w ( u ) v ( u )))It is straightforward to see that this is an optimization prob-lem with optimal and overlapping substructure, which wecan solve eﬃciently with memoization techniques.With this powerful compiling algorithm, we are now ableto handle any chain access expressions. Furthermore, thisalgorithm optimizes the generated Pregel program in twoaspects. First, this algorithm derives a message passingscheme with a minimum number of supersteps, thus re-duces unnecessary cost for launching supersteps in Pregelframework. Second, by extending the memoization tech-nique, we can ensure that a chain access expression will be6valuated exactly once even if it appears multiple times ina Palgol step, avoiding redundant message passing for thesame value. Neighborhood access is another important communicationpattern widely used in Pregel algorithms. Precisely speak-ing, neighborhood access refers to those chain access expres-sions inside a non-nested loop traversing an edge list (

Nbr , In or Out ), where the chain access expressions start fromthe neighboring vertex. The following code is a typical ex-ample of neighborhood access, which is a list comprehensionused in the S-V algorithm program (Figure 4):7 let t = minimum [ D[e. ref ] | e <- Nbr [u] ]

Syntactically, a ﬁeld access expression D [ e. ref ] can be easilyidentiﬁed as a neighborhood access.The compilation of such data access pattern is based onthe symmetry that if all vertices need to fetch the sameﬁeld of their neighbors, that will be equivalent to makingall vertices send the ﬁeld to all their neighbors. This is awell-known technique that is also adopted by Green-Marland Fregel, so we do not go into the details and simplysummarize the compilation procedure as follows:1. In the ﬁrst superstep, we prepare the data from neigh-bors’ perspective. Field access expressions like D [ e. ref ]now become neighboring vertices’ local ﬁelds D [ u ]. Ev-ery vertex then sends messages containing those valuesto all its neighboring vertices.2. In the next step, every vertex scans the message list toobtain all the values of neighborhood access, and thenexecutes the loop according to the Palgol program. Having introduced the compiling algorithm for remotedata reads in Palgol, here we give a general picture of thecompilation for a single Palgol step, as shown in Figure 8.The computational content of every Palgol step is compiledinto a main superstep . Depending on whether there are re-mote reads and writes, there may be a number of remotereading supersteps before the main superstep, and a remoteupdating superstep after the main superstep.We will use the main computation step of the S-V pro-gram (lines 5–13 in Figure 4) as an illustrative example forexplaining the compilation algorithm, which consists of thefollowing four steps:1. We ﬁrst handle neighborhood access, which requiresa sending superstep that provides all the remote datafor the loops from the neighbors’ perspective. (for S-Valgorithm, sending their D ﬁeld to all their neighbors).This sending superstep is inserted as a remote readingsuperstep immediately before the main superstep.2. We analyze the chain access expressions appearing inthe Palgol step with the algorithm in Section 3.1, andcorresponding remote reading supersteps are insertedin the front. (For the S-V algorithm, the only interest-ing chain access expression is D [ D [ u ]], which inducestwo remote reading supersteps realizing a request-replyconversation.) chain accessexpressions neighborhoodcommunication ∀ u . Kuu ∀ u . KuD [ u ] ∀ u . KD [ u ] u ∀ u . KD [ u ] D [ u ] send D [ u ] toall neighbors ∀ u . KuD [ u ] obtain all D [ e . ref ] D[D[u]] let t = minimum [D[e.ref]| e <- Nbr[u]] local computation send the value of t to D [ u ] remote D[D[u]]

3. Having handled all remote reads, the main superstepreceives all the values needed and proceeds with thelocal computation. Since the local computational con-tent of a Palgol step is similar to an ordinary program-ming language, the transformation is straightforward.4. What remain to be handled are the remote assign-ments, which require sending the updating values asmessages to the target vertices in the main super-step. (For S-V algorithm, there is one remote updat-ing statement at line 10, requiring that the value of t be sent to D [ u ].) Then an additional remote updat-ing superstep is added after the main superstep; thisadditional superstep reads these messages and updateseach ﬁeld using the corresponding remote updating op-erator. We ﬁnally tackle the problem of compiling sequence anditeration, to assemble Palgol steps into larger programs.A Pregel program generated from Palgol code is essen-tially a state transition machine (STM) combined with com-putation code for each state. In the simplest case, everyPalgol step is translated into a “linear” STM consisting ofa chain of states corresponding to the supersteps like thoseshown in Figure 8. In general, a generated STM may bedepicted as: S n S S where there are a start state and an end state, betweenwhich there can be more states and transitions, not neces-sarily having the linear structure.7 n S S n + m S n + S n + m S n + S n S Figure 9: The compilation of sequence. A moststraightforward way is shown on the left, and ourcompiler merges the states S n and S n +1 and createsthe STM on the right. A sequence of two Palgol programs uses the ﬁrst programto transform an initial graph to an intermediate one, which isthen transformed to a ﬁnal graph using the second program.To compile the sequence, we ﬁrst compile the two componentprograms into STMs; a composite STM is then built fromthese two STMs, implementing the sequence semantics.We illustrate the compilation in Figure 9. The left side isa straightforward way of compiling, and the right side is anoptimized one produced by our compiler, with states S n and S n +1 merged together. This is because the separation of S n and S n +1 is unnecessary: every Palgol program describesan independent vertex-centric computation that does notrely on any incoming messages (according to our high-levelmodel); correspondingly, our compilation ensures that theﬁrst superstep in the compiled program ignores the incomingmessages. We call this the message-independence property.Since S n +1 is the beginning of the second Palgol program,it ignores the incoming messages, and therefore the barriersynchronization between S n and S n +1 can be omitted. Fixed-point iteration repeatedly runs a program enclosedby ‘ do ’ and ‘ until . . . ’ until the speciﬁed ﬁelds stabilize.To compile an iteration, we ﬁrst compile its body into anSTM, then we extend this STM to implement the ﬁxed-point semantics. The output STM is presented in Figure 10,where the left one is generated by our general approach, andthe right one performs the fusion optimization when somecondition is satisﬁed.Let us start from the general approach on the left. Tem-porarily ignoring the initialization state, the STM imple-ments a while loop: ﬁrst, a check of the termination condi-tion takes place right before the state S : if the terminationcondition holds, we immediately enters the state Exit ; oth-erwise we execute the body, after which we go back to thecheck. The termination check is implemented by an ORaggregator to make sure that every vertex makes the samedecision: basically, every vertex determines whether its lo-cal ﬁelds are changed during a single iteration by storing theoriginal values before S , and sends the result (as a boolean)to the aggregator, which can then decide globally whetherthere exists any vertex that has not stabilized. What re-mains is the initialization state, which guarantees that thetermination check will succeed in the ﬁrst run, turning thewhile loop into a do-until loop.There is a chance to reduce the number of supersteps inthe loop body of the iteration STM when the ﬁrst state S S n S S Initialization exit?

Exit NY S n S Initialization S S exit? Exit NY Figure 10: An STM for general iteration is shownon the left. The fusion optimization applies whenthe iteration body begins with a remote reading su-perstep ( S ), and yields the STM on the right. of the loop body is a remote reading superstep (see Sec-tion 3.2). In this case, as shown on the right side of Fig-ure 10, the termination check is moved to the beginning ofthe second state S , and then the state S is duplicated andattached to the end of both the initialization state and S n .This transformation ensures that, no matter from where wereach the state S , we always execute the code in S in theprevious superstep to send the necessary messages. Withthis property guaranteed, we can simply iterate S to S n toimplement the iteration, so that the number of superstepsinside the iteration is reduced. The only diﬀerence withthe left STM is that we execute an extra S attached atthe end of S n when we exit the iteration. However, it stillcorrectly implements the semantics of iteration: the onlyaction performed by a remote reading superstep is sendingsome messages; although unnecessary messages are emit-ted, the Palgol program following the extra S will ignoreall incoming messages in its ﬁrst state, as dictated by themessage-independence property. Combiners are a mechanism in Pregel that may reduce thenumber of messages transmitted during the computation.Essentially, in a single superstep, if all the messages sentto a vertex are only meant to be consumed by a reduce-operator (e.g., sum or maximum) to produce a value onthat vertex, and the values of the individual messages arenot important, then the system can combine the messagesintended for the vertex into a single one by that operator,reducing the number of messages that must be transmittedand buﬀered.In Pregel, combiners are not enabled by default, since“there is no mechanical way to ﬁnd a useful combining func-tion that is consistent with the semantics of the user’s com-pute() method” [14]. However, Palgol’s list comprehensionsyntax combines remote access and a reduce operator, andnaturally represents such type of computation, which canpotentially be optimized by a combiner. A typical exam-ple is the SSSP program (line 7–8 in Figure 3), where thedistances received from the neighbors ( D [ e. ref ] + e. val ) aretransmitted and reduced by the minimum operator. Sincethe algorithm only cares about the minimum of the mes-sages, and the compiler knows that nothing else is carried8 able 1: Datasets for Performance EvaluationDataset Type | V | | E | Wikipedia Directed 18,268,992 172,183,984Facebook Undirected 59,216,214 185,044,032USA Weighted 23,947,347 58,333,344Random Chain 10,000,000 10,000,000by the messages in that superstep, the compiler can auto-matically implement a combiner with the minimum operatorto optimize the program.

4. EXPERIMENTS

In this section, we evaluate the overall performance ofPalgol and the state-merging optimisations introduced inthe previous section. We compile Palgol code to Pregel + ,which is an open-source implementation of Pregel written inC ++ . We have implemented the following six graph algo-rithms on Pregel + ’s basic mode, which are: • PageRank [14] • Single-Source Shortest Path (SSSP) [14] • Strongly Connected Components (SCC) [24] • Shiloach-Vishkin Connected Component Algorithm (S-V) [24] • List Ranking Algorithm (LR) [24] • Minimum Spanning Forest (MSF) [3]Among these algorithms, SCC, S-V, LR and MSF are non-trivial ones which contain multiple computing stages. TheirPregel+ implementations are included in our repository forinterested readers.We use 4 real-world graph datasets in our performanceevaluation, which are listed in Table 1: (1) Wikipedia : thehyperlink network of Wikipedia; (2) Facebook : a friend-ship network of the Facebook social network; (3) USA : theUSA road network; (4) Random: a chain with randomlygenerated values.The experiment is conducted on an Amazon EC2 clus-ter with 16 nodes (whose instance type is m4.large), eachcontaining 2 vCPUs and 8G memory. Each algorithm isrun on the type of input graphs to which it is applicable(PageRank on directed graphs, for example) with 4 conﬁg-urations, where the number of nodes changes from 4 to 16.We measure the execution time for each experiment, and allthe results are averaged over three repeated experiments.The runtime results of our experiments are summarized inTable 2. Palgol does not target a speciﬁc Pregel-like system. In-stead, by properly implementing diﬀerent backends of thecompiler, Palgol can be transformed into any Pregel-like sys-tem, as long as the system supports the basic Pregel inter-faces including message passing between arbitrary pairs ofvertices and aggregators. http://konect.uni-koblenz.de/networks/dbpedia-link https://archive.is/o/cdGrj/konect.uni-koblenz.de/networks/facebook-sg .

53% speedup to a 6 .

42% slow-down. The generated programs for PageRank and S-V arealmost identical to the hand-written versions, while somesubtle diﬀerences exist in SCC and MSF programs. ForSCC, the whole algorithm is a global iteration with severaliterative sub-steps, and the human written code can exitthe outermost iteration earlier by adding an extra asser-tion in the code (like a break inside a do ... until loop).Such optimization is not supported by Palgol currently. ForMSF, the human written code optimizes the evaluation of aspecial expression D [ D [ u ]] == u to only one round of commu-nication, while Palgol’s strategy always evaluates the chainaccess D [ D [ u ]] using a request followed by and a reply step,and then compares the result with u . These diﬀerences arehowever not critical to the performance.For SSSP, we observed a slowdown up to 29 . vote to halt() API to deactivate converged vertices duringcomputation; this accelerates the execution since the Pregelsystem skips invoking the compute() function for those in-active vertices, while in Palgol, we check the states of thevertices to decide whether to perform computation. Simi-larly, we observed a 24% slowdown for LR, since the human-written code deactivates all vertices after each superstep,and it turns out to work correctly. While voting to haltmay look important to eﬃciency, we would argue againstsupporting voting to halt as is, since it makes programs im-possible to compose: in general, an algorithm may containmultiple computation stages, and we need to control whento end a stage and enter the next; voting to halt, however,does not help with such stage transition, since it is designedto deactivate all vertices and end the whole computationright away.

In this subsection, we evaluate the eﬀectiveness of the“state merging” optimization mentioned in Section 3.3, bygenerating both the optimized and unoptimized versions ofthe code and executing them in the same conﬁgurations. Weuse all the six graph applications in the previous experiment,and ﬁx the number of nodes to 16.The experiment results are shown in Table 3. From thistable, we observed a signiﬁcant reduction on the numberof supersteps for all graph algorithms after optimization.Then, SSSP and SCC are roughly twice faster than the un-optimized version, but for other algorithms, the optimiza-tion has relatively small eﬀect on execution time.First, the reduction of the number of supersteps in execu-tion has a strong connection with the optimization resultsof the main iteration in these graph algorithms. For ap-plications containing only a simple iteration like PageRankand SSSP, we reduced nearly 2 / /

3, 1 / / able 2: Comparison of Execution Time between Palgol and Pregel + Implementation

Dataset Algorithm + Palgol Pregel + Palgol Pregel + Palgol Pregel + PalgolWikipedia SSSP 8.33 10.80 4.47 5.61 3.18 3.83 2.41 2.85 18.06% – 29.55%PageRank 153.40 152.36 83.94 82.58 61.82 61.24 48.36 47.66 -1.62% – 2.26%SCC 177.51 178.87 85.87 86.52 61.75 61.89 46.64 46.33 -0.66% – 0.77%Facebook S-V 143.09 142.16 87.98 86.22 67.62 65.90 58.29 57.49 -2.53% – -0.65%Random LR 56.18 64.69 29.58 33.17 19.76 23.48 14.64 18.16 12.14% – 24.00%USA MSF 78.80 82.57 43.21 45.98 29.47 31.07 22.84 24.29 4.79% – 6.42%

Table 3: Comparison of the Compiler-Generated Programs Before/After Optimization

Dataset Algorithm

The eﬀect of this optimization on execution time is how-ever related to not only the number of supersteps that arereduced, but also the property of the applications. On onehand, it reduces the number of global synchronizations sothat the performance can be improved. On the other hand,our optimization brings a small overhead, because we obtaina tighter loop body by unconditionally sending the necessarymessages for the next iteration at the end of each iteration.As a result, when exiting the loop, some redundant mes-sages are emitted (although the correctness of the gener-ated code is ensured). In our experiments, SSSP and SCCare roughly twice faster after optimization, since they arenot computational intensive, so that the number of globalsynchronization matters.

5. RELATED WORK

Google’s Pregel [14] proposed the vertex-centric comput-ing paradigm, which allows programmers to think natu-rally like a vertex when designing distributed graph algo-rithms. There are a bunch of open-source alternatives tothe oﬃcial and proprietary Pregel system, such as ApacheHama [2], Apache Giraph [1], Catch the Wind [19], GPS [17],GraphLab [13], PowerGraph [7] and Mizan [11]. This paperdoes not target a speciﬁc Pregel-like system. Some graph-centric (or block-centric) systems like Giraph + [20] and Blo-gel [23] extends Pregel’s vertex-centric approach by makingthe partitioning mechanism open to programmers, but itis still unclear how to optimize general vertex-centric algo-rithms (especially those complicated ones containing non-trivial communication patterns) using such extension.Domain-Speciﬁc Languages (DSLs) are a well-known mech-anism for describing solutions in specialized domains. Toease Pregel programming, many DSLs have been proposed,such as Palovca [12], s6raph [16], Fregel [4] and Green-Marl [10]. We brieﬂy introduce each of them below.Palovca [12] exposes the Pregel APIs in Haskell using amonad, and a vertex-centric program is written in a low-levelway like in typical Pregel systems. Since this language is still low-level, programmers are faced with the same challengesin Pregel programming, mainly having to tackle all low-leveldetails.At the other extreme, the s6raph system [16] is a specialgraph processing framework with a functional interface. Itmodels a particular type of iterative vertex-centric compu-tation by six programmer-speciﬁed functions, and can onlyexpress graph algorithms that contain a single iterative com-putation (such as PageRank and Shortest Path), whereasmany practical Pregel algorithms are far more complicated.A more comparable and (in fact) closely related piece ofwork is Fregel [4], which is a functional DSL for declarativeprogramming on big graphs. In Fregel, a vertex-centric com-putation is represented by a pure step function that takesa graph as input and produces a new vertex state; suchfunctions can then be composed using a set of predeﬁnedhigher-order functions to implement a complete graph al-gorithm. Palgol borrows this idea in the language designby letting programmers write atomic vertex-centric com-putations called Palgol steps, and put them together usingtwo combinators, namely sequence and iteration. Comparedwith Fregel, the main strength of Palgol is in its remote ac-cess capabilities: • a Palgol step consists of local computation and remoteupdating phases, whereas a Fregel step function canbe thought of as only describing local computation,lacking the ability to modify other vertices’ states; • even when considering local computation only, Pal-gol has highly declarative ﬁeld access expressions toexpress remote reading of arbitrary vertices, whereasFregel allows only neighboring access.These two features are however essential for implementingthe examples in Section 2, especially the S-V algorithm.Moreover, Palgol shows that Fregel’s combinator-based de-sign can beneﬁt from Green-Marl’s fusion optimizations (Sec-tion 3.3) and achieve eﬃciency comparable to hand-writtencode.10nother comparable DSL is Green-Marl [9], which letsprogrammers describe graph algorithms in a higher-levelimperative language. This language is initially proposedfor graph processing on the shared-memory model, and a“Pregel-canonical” subset of its programs can be compiled toPregel. Since it does not have a Pregel-speciﬁc language de-sign, programmers may easily get compilation errors if theyare not familiar with the implementation of the compiler.In contrast, Palgol (and Fregel) programs are by construc-tion vertex-centric and distinguish the current and previousstates for the vertices, and thus have a closer correspondencewith the Pregel model. For remote reads, Green-Marl onlysupports neighboring access, so it suﬀers the same problemas Fregel where programmers cannot fetch data from an ar-bitrary vertex. While it supports graph traversal skeletonslike BFS and DFS, these traversals can be encoded as neigh-borhood access with modest eﬀort, so it actually has thesame expressiveness as Fregel in terms of remote reading.Green-Marl supports remote writing, but according to ourexperience, it is quite restricted, and at least cannot be usedinside a loop iterating over a neighbor list, and thus is lessexpressive than Palgol.

6. CONCLUDING REMARKS

This paper has introduced Palgol, a high-level domain-speciﬁc language for Pregel systems with ﬂexible remotedata access, which makes it possible for programmers toexpress Pregel algorithms that communicate over dynamicinternal data structures. We have demonstrated the powerof Palgol’s remote access by giving two representative ex-amples, the S-V algorithm and the list ranking algorithm,and presented the key algorithm for compiling remote ac-cess. Moreover, we have shown that Fregels more structuredapproach to vertex-centric computing can achieve high eﬃ-ciency — the experiment results show that graph algorithmswritten in Palgol can be compiled to eﬃcient Pregel pro-grams comparable to human written ones.We expect Palgol’s remote access capabilities to help withdeveloping more sophisticated vertex-centric algorithms whereeach vertex decides its action by looking at not only its im-mediate neighborhood but also an extended and dynamicneighborhood. The S-V and list ranking algorithms are justa start — for a diﬀerently ﬂavored example, graph patternmatching [5] might be greatly simpliﬁed when the patternhas a constant size and can be translated declaratively asa remote access expression deciding whether a vertex andsome other “nearby” vertices exhibit the pattern. Algorithmdesign and language design are interdependent, with algo-rithmic ideas prompting more language features and higher-level languages making it easier to formulate and reasonabout more sophisticated algorithms. We believe that Pal-gol is a much-needed advance in language design that canbring vertex-centric algorithm design forward.

7. REFERENCES [1] Apache Giraph. http://giraph.apache.org/.[2] Apache Hama. http://hama.apache.org/.[3] S. Chung and A. Condon. Parallel implementation ofBor˚uvka’s minimum spanning tree algorithm. In

International Parallel Processing Symposium , pages302–308. IEEE, 1996.[4] K. Emoto, K. Matsuzaki, A. Morihata, and Z. Hu.Think like a vertex, behave like a function! a functional dsl for vertex-centric big graph processing.In

International Conference on FunctionalProgramming , pages 200–213. ACM, 2016.[5] A. Fard, M. U. Nisar, L. Ramaswamy, J. A. Miller,and M. Saltz. A distributed vertex-centric approachfor pattern matching in massive graphs. In

BigData ,pages 403–411. IEEE, 2013.[6] H. N. Gabow and R. E. Tarjan. A linear-timealgorithm for a special case of disjoint set union.

Journal of computer and system sciences ,30(2):209–221, 1985.[7] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin. PowerGraph: Distributed graph-parallelcomputation on natural graphs. In

USENIXSymposium on Operating Systems Design andImplementation , pages 17–30, 2012.[8] J. Y. Halpern and Y. Moses. Knowledge and commonknowledge in a distributed environment.

Journal ofthe ACM , 37(3):549–587, 1990.[9] S. Hong, H. Chaﬁ, E. Sedlar, and K. Olukotun.Green-Marl: a DSL for easy and eﬃcient graphanalysis. In

International Conference on ArchitecturalSupport for Programming Languages and OperatingSystems , pages 349–362. ACM, 2012.[10] S. Hong, S. Salihoglu, J. Widom, and K. Olukotun.Simplifying scalable graph processing with adomain-speciﬁc language. In

International Symposiumon Code Generation and Optimization , page 208.ACM, 2014.[11] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom,D. Williams, and P. Kalnis. Mizan: a system fordynamic load balancing in large-scale graphprocessing. In

European Conference on ComputerSystems , pages 169–182. ACM, 2013.[12] M. Lesniak. Palovca: describing and executing graphalgorithms in Haskell. In

International Symposium onPractical Aspects of Declarative Languages , pages153–167. Springer, 2012.[13] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin,A. Kyrola, and J. M. Hellerstein. DistributedGraphLab: a framework for machine learning anddata mining in the cloud.

Proceedings of the VLDBEndowment , 5(8):716–727, 2012.[14] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski. Pregel: asystem for large-scale graph processing. In

International Conference on Management of Data ,pages 135–146. ACM, 2010.[15] L. Quick, P. Wilkinson, and D. Hardcastle. UsingPregel-like large scale graph processing frameworks forsocial network analysis. In

International Conferenceon Advances in Social Networks Analysis and Mining(ASONAM 2012) , pages 457–463. IEEE, 2012.[16] O. C. Ruiz, K. Matsuzaki, and S. Sato. s6raph:vertex-centric graph processing framework withfunctional interface. In

International Workshop onFunctional High-Performance Computing , pages58–64. ACM, 2016.[17] S. Salihoglu and J. Widom. GPS: a graph processingsystem. In

International Conference on Scientiﬁc andStatistical Database Management , number 22. ACM,2013.1118] S. Salihoglu and J. Widom. Optimizing graphalgorithms on Pregel-like systems.

Proceedings of theVLDB Endowment , 7(7):577–588, 2014.[19] Z. Shang and J. X. Yu. Catch the wind: graphworkload balancing on cloud. In

InternationalConference on Data Engineering , pages 553–564.IEEE, 2013.[20] Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, andJ. McPherson. From think like a vertex to think like agraph.

PVLDB , 7(3):193–204, 2013.[21] L. G. Valiant. A bridging model for parallelcomputation.

Communications of the ACM , 33(8):103–111, 1990.[22] M. Xie, Q. Yang, J. Zhai, and Q. Wang. A vertexcentric parallel algorithm for linear temporal logicmodel checking in Pregel.

Journal of Parallel andDistributed Computing , 74(11):3161–3174, 2014.[23] D. Yan, J. Cheng, Y. Lu, and W. Ng. Blogel: Ablock-centric framework for distributed computationon real-world graphs.

PVLDB , 7(14):1981–1992, 2014.[24] D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu.Pregel algorithms for graph connectivity problemswith performance guarantees.