Partitioning SKA Dataflows for Optimal Graph Execution
PPartitioning SKA Dataflows for Optimal Graph Execution
Chen Wu
International Centre for RadioAstronomy ResearchThe University of WesternAustraliaPerth, Australia [email protected] Andreas Wicenec
International Centre for RadioAstronomy ResearchThe University of WesternAustraliaPerth, Australia [email protected] Rodrigo Tobar
International Centre for RadioAstronomy ResearchThe University of WesternAustraliaPerth, Australia [email protected]
ABSTRACT
Optimizing data-intensive workflow execution is essential tomany modern scientific projects such as the Square Kilome-tre Array (SKA), which will be the largest radio telescopein the world, collecting terabytes of data per second for thenext few decades. At the core of the SKA Science DataProcessor is the graph execution engine, scheduling tens ofthousands of algorithmic components to ingest and trans-form millions of parallel data chunks in order to solve aseries of large-scale inverse problems within the power bud-get. To tackle this challenge, we have developed the DataActivated Liu Graph Engine (DALiuGE) to manage dataprocessing pipelines for several SKA pathfinder projects. Inthis paper, we discuss the DALiuGE graph scheduling sub-system. By extending previous studies on graph schedulingand partitioning, we lay the foundation on which we can de-velop polynomial time optimization methods that minimizeboth workflow execution time and resource footprint whilesatisfying resource constraints imposed by individual algo-rithms. We show preliminary results obtained from threeradio astronomy data pipelines.
Keywords
Graph execution; Scheduling; Square Kilometre Array
1. INTRODUCTION
The Square Kilometre Array (SKA) will be the largestradio telescope in the world [3]. The two components ofthe first phase of SKA (SKA1) — SKA-Mid and SKA-Low— will jointly produce large amounts of data at a rate ofone Terabyte (TB) per second, with the second phase datarate reaching at least ten times higher. All this data hasto be captured, reduced, processed, and analyzed in nearreal-time. This poses a great challenge, since the currentgeneration of radio astronomy data processing systems aredesigned to handle data approximately two to three ordersof magnitude smaller than that of the SKA1.
ACM ISBN 123-4567-24-567/08/06.DOI:
To tackle this challenge, we developed the Data Acti-vated Liu Graph Engine (DALiuGE ) to execute continu-ous, time-critical, data-intensive workflows in order to pro-duce science-ready data products. Compared to existingastronomical workflow systems, DALiuGE has several ad-vantages such as separation of concerns, data-centric execu-tion, graph-based dataflow scheduling, and native supportfor streaming processing.A technical overview of DALiuGE and its operational pro-duction systems are described in [21]. In this paper, we focuson the DALiuGE graph scheduling sub-system. In partic-ular, we discuss technical details on dataflow partitioningalgorithms and implementations.
2. RELATED WORK
The dataflow computation model [7] represents workflowsas Directed Acyclic Graphs (DAG), where vertices are state-less computational tasks (i.e. functions) and edges connectthe output of one task with the input of another. Althoughthe dataflow model exploits parallelism inherent in DAGsthrough data dependencies, mapping an irregular DAG ontohardware resources for optimal execution is an NP-hard prob-lem [5]. Early work attempted to derive data structures(e.g. assignment graph [2] or allocation graph [19]) from theoriginal DAG in order to perform tractable searching andoptimisation algorithms (e.g. using the maximum flow so-lutions [17]). While these algorithms were able to uncoveran optimal solution in polynomial time, the growth rate ofthe assignment graph is O ( N × M ), where N denotes thenumber of vertices in the original DAG and M denotes thenumber of available processors. Therefore, as the DAG sizeand resource pool grows substantially (e.g. from tens oftasks running on a laptop to millions of tasks running onthousands of processors), these exact optimisation methodsquickly become intractable.A variety of heuristics-based algorithms [12] have beendeveloped for scheduling DAGs on multiprocessors. Theseheuristics in general fall into two alternative approaches —one-phase or two-phase. In the one-phase approach (e.g., thewidely-used HEFT algorithm [18]), DAG scheduling is per-formed by directly mapping a ranked list of workflow tasksto another ranked list of resource units (e.g. processors ornodes) based on some aggregated run-time workflow profilesand resource statistics. In contrast, the two-phase approach[13, 16] first partitions the DAG into a number of clustersbased on heuristics such as load balancing [11], minimal data https://github.com/ICRAR/daliuge a r X i v : . [ c s . D C ] M a y igure 1: The complete dataflow execution cycle consists of four major steps — unrolling, partitioning, map-ping and dynamic scheduling. Both unrolling and partitioning are performed offline. Mapping happens justa few minutes before the workflow execution, and dynamic scheduling is done in real-time during execution.While the first three steps target the entire graph across multiple resources, the last step focuses on tasksusing local resources on a single node. movement, etc. In the second phase, these clusters are thenmapped onto actual hardware resources for execution. Wecurrently adopt the two-phase approach because the outputfrom the first phase encodes a resource demand abstraction(RDA) from intrinsic properties of the DAG. The RDA be-comes the input for resource mapping in the second phase.More importantly, the RDA provides a more accurate esti-mate of resource demand for future capacity planning andobservation scheduling for the telescope manager. However,most two-phase algorithms were targeted to multiprocessorson a single compute node, where each workflow task con-sumes exactly one processor. Our workflows need to runacross clusters of compute nodes, each consisting of multipleprocessors. More importantly, each workflow task inherentlydemands multiple yet different number of processors/coresand different amount of memories. Dealing with this kind ofcomplexity in resource demand and multiplicity in resourcecapabilities is one of our contributions in this paper. More-over, unlike most existing DAG scheduling/mapping algo-rithms, our partitioning algorithm aims to reduce the overallresource footprint given these complexities and constraints.On the other hand, the advantage of the one-phase ap-proach is its flexibility to incorporate run-time resource het-erogeneity. We leave for our future work a thorough inves-tigation and application of the one-phase approach to ourDAG mapping problem.Although significant progress [1, 15, 20] has been maderecently to partition vary large graphs for various social net-work analysis and machine learning applications, direct ap-plication of these graph partitioning algorithms for dataflowpartitioning often leads to sub-optimal solutions. This is because the DAG (or general graph) representation G ofthe dataflow does not encode the notion of workflow exe-cution working set W t - a small set of workflow tasks thatare being executed at time t . Only tasks in W t consumeresources, other tasks are either waiting for the completionof their “upstream” tasks in W t or have already completedtheir executions. Therefore, partitioning the entire graph G (e.g. in the order of millions of nodes) for subsequent re-source mapping is (1) wasteful given that | W t | (cid:28) | G | , and(2) ill-posed since W t is time-dependent and is unknown atthe time of graph partitioning.
3. OVERVIEW OF GRAPH EXECUTION
Following the two-phase approach, the four steps of thegraph execution are illustrated in Figure 1. We briefly in-troduce them in this section. Readers are referred to for adetailed technical discussion on graph execution.Starting from the top left corner, a staff astronomer com-poses a logical graph representing high-level data process-ing capabilities (e.g., “Image deconvolution”) using resourceoblivious dataflow constructs and workflow task components.The first step unrolls the logical graph by expanding allparallel branches and loops, instantiating tasks in all branchesand iterations and connecting them with directed edges asper the logical graph definition. The result of unrolling isthe Physical Graph Template (PGT) shown in the top rightcorner. It should be noted that, unlike traditional dataflowgraph representations, DALiuGE models data as well astasks as graph vertices. From a workflow viewpoint, all dataitems are essentially “data tasks” (shown as parallelogramsn Figure 1) that can trigger the execution of their consumertasks (shown as rectangles).The second step, i.e. the focus of this paper, divides thePGT into a set of logical partitions such that certain perfor-mance requirements (e.g. total completion time, total datamovement, etc.) are met under given constraints (e.g. re-source footprint, collocation criteria, device locality, etc.).This step outputs the Physical Graph Template Partition(PGTP), which provides the Telescope Manager with an ap-proximate solution to construct the observation schedulingblocks months or weeks prior to observation and compute re-source allocation. An example of PGTP is shown at the bot-tom right of Figure 1, where 19 partitions are produced andone of them is visually expanded with 11 enclosing workflowtasks. Furthermore, a resource reservation that contains 19nodes can be submitted to the telescope manager weeks be-fore the associated observation takes place.The third step maps each logical partition of the PG ontoa given set of currently available resources in certain optimalways. In principle, each partition is placed onto a physicalcompute node in the cluster. Such placement requires real-time information on resource availability, and we currentlyassume resource pools consisting of nodes with identical ca-pabilities of computing, storage, and interconnect. In caseswhere the number of partitions p is greater than the numberof available nodes m , DALiuGE can be configured to mergethe p PGT partitions into m virtual clusters with the goalof balancing the overall workload (both compute time andmemory usage) evenly before mapping.The final step involves optimal execution of tasks thathave been allocated to a single node by the previous twosteps. DALiuGE currently offloads this step to local sched-ulers provided by the host OS running on each computenode. We are currently working on the integration of graph-based GPU schedulers for dynamically scheduling GPU ac-celerated workflow tasks on single node with multiple GPUs.In the following sections, we focus solely on the technicaldetails of the second step — dataflow partitioning.
4. DATAFLOW PARTITIONING
During graph partitioning, a PGT of N vertices is decom-posed into M partitions, each of which conceptually repre-sents a compute node with a pre-defined resource capacityvector C . The goal of graph partitioning is to obtain an esti-mate on the minimum number M ∗ of compute nodes neededto execute the PGT and its corresponding PGT completiontime T M ∗ . Initially, the partitioning algorithm lets M = N with each vertex being an individual partition. The algo-rithm then iteratively decreases M through partition merg-ing (line 11 in Algorithm 1). This is equivalent to keepingthe PGT completion time T monotonically non-increasing asexemplified in Figure 3. See Theorem 1 for a proof. There-fore, a partition scheme that produces M ∗ ideally achievesthe minimum PGT completion time T ∗ , thus T M ∗ = T ∗ un-der the current graph partitioning algorithm. This followsthe “data locality” principle which suggests that the unit costof data movement between two partitions is far greater thanthat within the same partition. Therefore, fewer partitionslead to faster completion with less data movement, resourceusage and lower operational cost.On the other hand, a smaller M ∗ corresponds to a greaterresource demand per partition since more Drops are allo-cated to each partition. This means the aggregated resource demand from concurrently-running Drops in a given parti-tion is more likely to exceed C , slowing down the graphexecution due to resource over-subscription. An ideal par-titioning solution not only obtains an optimal M ∗ but alsoensures that resource demands in all partitions stay below C at any point during the graph execution. Satisfying this con-straint avoids unpredictable execution delay due to resourceover-subscription, thus ensuring T M ∗ = T ∗ . Formally, thegraph partitioning is formulated as a constrained optimisa-tion problem:min p M ( P GT, p )s.t. R i ( t ) ≤ C , i = 1 , . . . , M., ∀ t ∈ (cid:104) , T ( P GT, p ) (cid:105) (1)where M ( · ) is a function that outputs the number M ofpartitions given a P GT and a partition solution p . T ( · ) is afunction that outputs the completion time T given a P GT and a partition solution p . R i ( t ) denotes the aggregatedresource demand from all running Drops in partition i attime t . We refer to the constraint defined in Equation 1as the DoP constraint , where “DoP” stands for
Degree ofParallelism . Figure 2 exemplifies partitioning solutions thatdo (not) satisfy the DoP constraint.
Figure 2: Three solutions to partitioning a sim-ple fork-like Physical Graph Template. Solution(a) places all three Drops inside a single partition.So the two worker Drops will run in parallel afterthe data becomes available, thus consuming 8 cores(four threads each) at the same time. This satis-fies the DoP constraint given the resource capac-ity C for a compute node includes 8 cores. How-ever, solution (c) does not satisfy the DoP con-straint since at some point 16 threads will be run-ning in parallel on a single 8-Core machine. Con-sequently, the expected completion time for eitherworker is no longer guaranteed due to resource over-subscription. To remedy this, solution (b) separatesthe two worker Drops in two different partitions,each of which has sufficient resource capacity to ex-ecute 8 threads. Although the data movement be-tween the two partitions incurs additional cost com-pared to Solution (c), Solution (b) produces far morereliable estimates on both completion time and re-source demands with a potentially shorter comple-tion time thanks to adequate resource provisioning. Once the optimal graph partitioning solution p is avail-able, both M ∗ (known as the Physical Graph Template Par-tition ) and T M ∗ (i.e. T ∗ ) are used by the telescope managerfor the generation of observation and computing resourceschedules well before the observation takes place. .1 Partitioning Algorithm The main idea of the partitioning algorithm (Algorithm 1)is to iteratively reduce data movement between inter-nodeDrops by “merging” them into the same node, where the costof intra-node communication is negligible. Given a PGT g ,the algorithm sorts all edges in g based on their weights in adescending order. The edge weight here denotes the volumeof data “on the move” from one Drop to the next. Eachdrop is initially allocated to a separate node. Then goingthrough all edges in a descending order of their weights,the algorithm merges two partitions associated with the twoDrops on both ends of the edge if the merged partition meetsthe DoP constraint defined in Equation 1. The algorithmis “greedy” since it reduces larger costs before dealing withsmaller ones. However, this may not necessarily lead to aglobally optimal solution especially for large graphs. Weare currently investigating various local search heuristics toovercome this limitation.Although the iterative edge zeroing procedure is based onthe graph clustering algorithm [16], we added two importantadditional changes. First we allow two existing partitions tore-merge again in order to further reduce the number of par-titions, which in turn reduces the total completion time assuggested in Theorem 1 . Second, we evaluate the DoP con-straint in order to accept or reject partition merging (line12) proposals. The evaluation of the DoP constraint notonly considers each graph vertex’s processing requirement interms of maximum number of concurrent threads, memoryusage, etc., but also incorporates predefined resource capac-ities for each partition including number of cores, memorycapacity, etc. input :
A DAG g with a list el of edges and a list nl of nodes output: A list l of partitions initialise l as an empty list el.sort by weight ( reverse ← true ) foreach element n of nl do part ← create partition () part.add ( n ) l.add ( part ) end foreach element e of el do origin weight ← e.weight e.weight ← // edge zeroing u, v ← e.nodes () new part ← try merge partition ( l, u, v ) if new part == NULL then e.weight ← origin weight end end return l Algorithm 1:
The partitioning algorithm based on[16] with two important additions — evaluation ofthe DoP constraint and merger of existing partitions
Theorem The edge zeroing statement at line 10 in Al-gorithm 1 ensures the completion time T of g is strictly non-increasing. Proof.
If the edge e is on the longest path L of g with alength T , there are two possibilities after e ’s weight becomes zero — L remains the longest path of g or another path L (cid:48) becomes the longest path of g . In the first case, let T (cid:48) be the new length of L . It is easy to verify that T (cid:48) == T − e.weight < T . In the second case, let T (cid:48) be the lengthof L (cid:48) . It must be true that T (cid:48) ≤ T because otherwise L (cid:48) (rather than L ) would have been the longest path beforethe edge zeroing takes place.If the edge e is not on the longest path L of g , thereare also two possibilities after e ’s weight becomes zero — e remains off the longest path L of g or e becomes part ofthe “new” longest path L (cid:48) of g . In the first case, since L isnot affected whatsoever, its completion time T remains thesame, thus non-increasing. In the second case, let T (cid:48) be thelength of L (cid:48) . It must be true that T (cid:48) ≤ T because otherwise L (cid:48) (rather than L ) would have been the longest path beforethe edge zeroing takes place. In this subsection, we discuss the DoP evaluation algo-rithm defined in the try_merge_partition function calledat line 12 in Algorithm 1. As shown in Equation 1, thisboils down to efficiently computing the total resource usage R i ( t ) summed over all running Drops inside a given parti-tion i at a particular time t . To do this, we first establishthe equivalence between the set D ( t ) of Drops running inparallel at time t and the concept of antichain [14] — a setof mutually unreachable vertices of a DAG g associatedwith a given partition. Theorem If all Drops in D ( t ) are running in a non-streaming mode, D ( t ) is an antichain of g . Proof.
The non-streaming running mode excludes onepossible form of parallelism — pipelining. All other formsof parallelisms require Drops in D ( t ) be mutually unreach-able on r because otherwise they would never have beenrunning in parallel due to their inter-dependencies as a re-sult of reachability.We define the length L of an antichain D ( t ) as the num-ber of Drops in D ( t ), and define the weighted length W ofan antichain D ( t ) as the aggregated weight summed over allDrops in D ( t ). The weight of the j th Drop d j in an antichainis the pre-determined peak resource usage denoted by w ( d j ).Let A denote the set of all antichains in an partition graph i . It then follows from Theorem 2 that the total resourceusage R i ( t ) is bounded by some antichain(s) D that has themaximum (longest) weighted length amongst all antichainsin A : R i ( t ) ≤ W max = max D L (cid:88) j =1 w ( d j ) , where d j ∈ D, L = | D | , D ∈ A , ∀ t ∈ (cid:104) , T ( P GT, p ) (cid:105) (2)Equation 2 bounds a time-dependent value R i ( t ) by a time-invariant constant W max such that if W max ≤ C for a givenpartition, the constraint condition R i ( t ) ≤ C in Equation1 will be satisfied. However, finding the antichain D ∗ thatproduces W max is not trivial since the cardinality of A —the total number of antichains in a partition graph g — canbe in the order of 2 n , with n being the number of vertices in g . Therefore, enumeration and evaluation of all antichains igure 3: The PGT completion time is monotonically non-increasing as the number of partitions decreasesfor three different radio astronomy pipeline graphs. It also shows the partition solution that produces theminimum number M ∗ of partitions (i.e. the bottom right end of each curve) also results in the shortestexecution time T ∗ . is computationally unfeasible in practice, where a typicalpartition has at least tens or even hundreds of tasks (e.g.there could be up to one billion antichains for a graph withmerely 30 vertices).To compute the maximum antichain length for a givengraph in polynomial time, one can apply Dilworth’s Theo-rem [8], which states that the maximum length of an an-tichain is equal to the minimum number of chains neededto fully “cover” the graph. In particular Fulkerson [9] es-tablished the equivalence between the maximum antichainlength and the maximum matching in a constructed splitgraph (a.k.a. bipartite graph). As a result, the longest an-tichain — the antichain that has the maximum cardinal-ity — of a graph can be discovered in O ( | E | (cid:112) | V | ) time.However, Equation 2 suggests that the longest antichaindoes not necessarily have the longest weighted length un-less w ( d j ) = 1 , ∀ j ∈ [1 , L ]. Hence, whilst we can efficientlysolve W max for a special case where each Drop consumesonly one unit of resource (e.g. 1 core, 1G of RAM, etc.),we need a different algorithm to evaluate more generic caseswhere Drops consume arbitrary units of resources (e.g. 16cores, 375 MB of RAM).In the following, we discuss details of Algorithm 2 thatefficiently computes W max for generic cases based on Cong[6] to compute a maximum weighted k -family. While a k -family covers a union of at most k antichains in a DAG, weare interested only in a special case (where k = 1) in orderto solve our problem of computing the maximum weightedlength of a single antichain.The central idea of Algorithm 2 is to exploit the equiva- lence between the weighted maximum anti-chain of the orig-inal DAG g and the minimum-cost maximum-flow (MCMF)solution of the split graph S created at Line 2. The equiv-alence is proved in [6] and more generally in [4]. Note thatnumber of nodes of S is 2 V +2 where V is the number of theoriginal DAG, which is the union g of the two DAGs g A and g B . This ensures that a polynomial algorithm on S remainstractable on g .To find the MCMF solution, we first derive the admissiblegraph H from S (line 3), and run the normal maximum flowalgorithm [10] to obtain the flow f (cid:48) in O ( V √ E ) time (line4). We then construct the residual graph R from f (cid:48) (line 5). R has the identical set of vertices as H , and if there are noedges going from the source vertex s of R to some vertex x ,then we set the node potential π of x to 1 (line 8). In theend, the maximum weighted antichain W max is calculated(line 14 to 20) based on expressions defined in Theorem 3.1[6]. Figure 3 shows the results of running Algorithm 1 and2 by scheduling three different radio interferometry imagingworkflows.
5. CONCLUSIONS
Optimal scheduling of large-scale, data-intensive work-flows is challenging. In this paper, we discussed relatedwork on graph scheduling and proposed polynomial time op-timization methods that minimize both workflow executiontime and resource footprint while meeting resource demandconstraints imposed by individual algorithms. We show pre-liminary results obtained from three radio astronomy data nput : partitions A and B with their associatedDAGs g A and g B an optional g dag representing theunpartitioned physical graph template output: the maximum weighted antichain length ofthe merged partition A (cid:83) B function get pi solution(g) : S ← create split graph( g ) H ← admissible graph( S ) f (cid:48) ← maximum flow( H, s, t ) R ← residual graph( H, f (cid:48) ) foreach element r node ∈ R. nodes () do if R. has path ( s, r node ) then pi [ r node ] ← else pi [ r node ] ← end return pi end pi ← get pi solution( g A (cid:83) g B ) W max ← for h ← to do foreach element nd x ∈ S.X do nd y ← S.counter part ( nd x ) if h = 1 − pi [ nd x ] + pi [ node S ] and pi [ nd y ] − pi [ nd x ] then W max ← W max + S. edge( node S, nd x ) . capacity end end end return W max Algorithm 2:
Calculate W max , the maximumweighted antichain length in a partitionpipelines.
6. REFERENCES [1] M. Bateni, S. Behnezhad, M. Derakhshan,M. Hajiaghayi, R. Kiveris, S. Lattanzi, andV. Mirrokni. Affinity clustering: Hierarchicalclustering at scale. In
Advances in Neural InformationProcessing Systems , pages 6867–6877, 2017.[2] S. H. Bokhari. A shortest tree algorithm for optimalassignments across space and time in a distributedprocessor system.
IEEE transactions on SoftwareEngineering , (6):583–589, 1981.[3] R. Braun, T. Bourke, J. Green, E. Keane, andJ. Wagg. Advancing astrophysics with the squarekilometre array.
Advancing Astrophysics with theSquare Kilometre Array (AASKA14) , 1:174, 2015.[4] K. Cameron. Antichain sequences.
Order ,2(3):249–255, 1985.[5] V. Chaudhary and J. K. Aggarwal. A generalizedscheme for mapping parallel algorithms.
IEEETransactions on Parallel and Distributed Systems ,4(3):328–346, 1993.[6] J. Cong.
Computing maximum weighted k-families andk-cofamilies in partially ordered sets . ComputerScience Department, University of California, 1993.[7] J. B. Dennis and D. P. Misunas. A preliminary architecture for a basic data-flow processor. In
ACMSIGARCH Computer Architecture News , volume 3,pages 126–132. ACM, 1975.[8] R. P. Dilworth. A decomposition theorem for partiallyordered sets.
Annals of Mathematics , pages 161–166,1950.[9] D. R. Fulkerson. Note on dilworthˆa ˘A´Zs decompositiontheorem for partially ordered sets. In
Proc. Amer.Math. Soc , volume 7, pages 701–702, 1956.[10] A. V. Goldberg and R. E. Tarjan. A new approach tothe maximum-flow problem.
Journal of the ACM(JACM) , 35(4):921–940, 1988.[11] G. Karypis and V. Kumar. Multilevelk-waypartitioning scheme for irregular graphs.
Journal ofParallel and Distributed computing , 48(1):96–129,1998.[12] Y.-K. Kwok and I. Ahmad. Static schedulingalgorithms for allocating directed task graphs tomultiprocessors.
ACM Computing Surveys (CSUR) ,31(4):406–471, 1999.[13] J.-C. Liou and M. A. Palis. A comparison of generalapproaches to multiprocessor scheduling. In
ParallelProcessing Symposium, 1997. Proceedings., 11thInternational , pages 152–156. IEEE, 1997.[14] D. Marcus.
Graph theory: a problem orientedapproach . The Mathematical Association of America,2008.[15] C. Martella, D. Logothetis, A. Loukas, andG. Siganos. Spinner: Scalable graph partitioning inthe cloud. In
Data Engineering (ICDE), 2017 IEEE33rd International Conference on , pages 1083–1094.Ieee, 2017.[16] V. Sarkar.
Partitioning and scheduling parallelprograms for execution on multiprocessors . PhD thesis,1987.[17] H. S. Stone. Multiprocessor scheduling with the aid ofnetwork flow algorithms.
IEEE transactions onSoftware Engineering , (1):85–93, 1977.[18] H. Topcuoglu, S. Hariri, and M.-y. Wu.Performance-effective and low-complexity taskscheduling for heterogeneous computing.
Parallel andDistributed Systems, IEEE Transactions on ,13(3):260–274, 2002.[19] D. Towsley. Allocating programs containing branchesand loops within a multiple processor system.
IEEETransactions on Software Engineering ,(10):1018–1024, 1986.[20] C. Tsourakakis, C. Gkantsidis, B. Radunovic, andM. Vojnovic. Fennel: Streaming graph partitioning formassive scale graphs. In
Proceedings of the 7th ACMinternational conference on Web search and datamining , pages 333–342. ACM, 2014.[21] C. Wu, R. Tobar, K. Vinsen, A. Wicenec, D. Pallot,B. Lao, R. Wang, T. An, M. Boulton, I. Cooper, et al.Daliuge: A graph execution framework for harnessingthe astronomical data deluge.