[PDF] Buffered Streaming Graph Partitioning

Abstract

Partitioning graphs into blocks of roughly equal size is a widely used tool when processing large graphs. Currently there is a gap in the space of available partitioning algorithms. On the one hand, there are streaming algorithms that have been adopted to partition massive graph data on small machines. In the streaming model, vertices arrive one at a time including their neighborhood and then have to be assigned directly to a block. These algorithms can partition huge graphs quickly with little memory, but they produce partitions with low quality. On the other hand, there are offline (shared-memory) multilevel algorithms that produce partitions with high quality but also need a machine with enough memory to partition a network. In this work, we make a first step to close this gap by presenting an algorithm that computes high-quality partitions of huge graphs using a single machine with little memory. First, we extend the streaming model to a more reasonable approach in practice: the buffered streaming model. In this model, a PE can store a batch of nodes (including their neighborhood) before making assignment decisions. When our algorithm receives a batch of nodes, we build a model graph that represents the nodes of the batch and the already present partition structure. This model enables us to apply multilevel algorithms and in turn compute high-quality solutions of huge graphs on cheap machines. To partition the model, we develop a multilevel algorithm that optimizes an objective function that has previously shown to be effective for the streaming setting. Surprisingly, this also removes the dependency on the number of blocks from the running time. Overall, our algorithm computes on average 55% better solutions than Fennel using a very small batch size. In addition, our algorithm is significantly faster than one of the main one-pass partitioning algorithms for larger amounts of blocks.

Full PDF

BBuffered Streaming Graph Partitioning

Marcelo Fonseca Faraj

Heidelberg University [email protected] Christian Schulz

Heidelberg University [email protected]

ABSTRACT

Partitioning graphs into blocks of roughly equal size is a widelyused tool when processing large graphs. Currently there is a gapobserved in the space of available partitioning algorithms. Onthe one hand, there are streaming algorithms that have beenadopted to partition massive graph data on small machines. Inthe streaming model, vertices arrive one at a time including theirneighborhood and then have to be assigned directly to a block.These algorithms can partition huge graphs quickly with littlememory, but they produce partitions with low solution quality.On the other hand, there are oﬄine (shared-memory) multilevelalgorithms that produce partitions with high quality but alsoneed a machine with enough memory to partition huge networks.In this work, we make a ﬁrst step to close this gap by pre-senting an algorithm that computes high-quality partitions ofhuge graphs using a single machine with little memory. First,we extend the streaming model to a more reasonable approachin practice – the buﬀered streaming model. In this model,a PE can store a buﬀer, or batch, of nodes (including theirneighborhood) before making assignment decisions. When ouralgorithm receives a batch of nodes, we build a model graphthat represents the nodes of the batch and the already presentpartition structure. This model enables us to apply multilevelalgorithms and in turn compute high-quality solutions of hugegraphs on cheap machines. To partition the model, we developa multilevel algorithm that optimizes an objective function thathas previously shown to be eﬀective for the streaming setting.Surprisingly, this also removes the dependency on the number ofblocks k from the running time compared to the previous state-of-the-art. Overall, our algorithm computes, on average, 55%better solutions than Fennel using a very small buﬀer size. Inaddition, for large values k our algorithm becomes signiﬁcantlyfaster than one of the main one-pass partitioning algorithms.

1. INTRODUCTION

Complex graphs are increasingly being used to model phenom-ena such as social networks, data dependency in applications,citations of papers, and biological systems like the human brain.Often these graphs are composed of billions of entities thatgive rise to speciﬁc properties and structures. As a concreteexample to cope with such graphs, graph databases [10] andgraph processing frameworks [9, 22, 13] can be used to storea graph, query it, and provide other operations. If the graphsbecome too large, then they have to be distributed over manymachines in order for the system to provide scalable operations.A key operation for scalable computations on huge graphsis to partition its components among k PEs such that eachPE receives roughly the same amount of components and thecommunication between PEs in the underlying application isminimized. In the distributed setup, each processing element(PE) operates on some portion of the graph and communicateswith other PEs through message-passing. This operation isnaturally modeled by the graph partitioning problem, whichcomputes a partition of the graph into k -blocks such that theblocks have roughly the same size and the number of edgescrossing blocks, i. e., communication, is minimized. Graph parti-tioning is NP-complete [12] and there can be no approximationalgorithm with a constant ratio factor for general graphs [7].Thus, heuristic algorithms are used in practice.There has been an extensive body of work in the area of graphpartitioning. Roughly speaking, there are streaming algorithms,internal memory (shared-memory parallel) algorithms and dis-tributed memory parallel algorithms. However, currently thereis a gap observed in the design space of available partitioningalgorithms. First of all, the most popular streaming approach inliterature is the one-pass model, in which vertices arrive one ata time including their neighborhood and then have to be perma-nently assigned to blocks. Algorithms based on this model canpartition huge graphs quickly with little memory, but producelow-quality partitions. To improve partition quality, the graphcan be restreamed while the one-pass algorithm updates blockassignment, but this is still far behind oﬄine approaches. Oﬄinemultilevel algorithms such as KaHIP[35] are widely known andcan produce partitions with high quality. Nevertheless theycannot partition huge graphs unless a machine with suﬃcientmemory is used. Lastly, distributed algorithms are able to par-tition huge graphs successfully and compute solutions of highquality. However, they require a large amount of computationalresources and typically access to a supercomputer, which canbe infeasible. Contribution.

In this work, we start to ﬁll the gap currentlyobserved for the existing graph partitioning algorithms. We1 a r X i v : . [ c s . D S ] F e b ropose an algorithm that can produce high-quality partitionsof huge graphs using a single machine without a lot of memory.First, we relax a limiting constraint of streaming algorithmsand allow a buﬀer of nodes to be received and stored beforemaking assignment decisions. We believe that this is a morereasonable approach in practice as a compute server typicallyhas much more memory than the amount required to store asingle node and its neighbors. Furthermore, this one-vertexpolicy contrasts with the following facts: (i) the best one-passalgorithms already assume enough memory to keep all nodeassignments throughout the execution, (ii) most real-worldlarge graphs are very sparse. Our algorithm is then carefullyengineered to produce partitions of high quality by using asophisticated multilevel scheme on a compressed representationof the buﬀer and the already assigned vertices. Our multilevelalgorithm optimizes for the same objective as the previousstate-of-the-art Fennel. However, due to the multilevel schemeused on the compressed model, our local search algorithmshave a global view on the optimization problem and hencecompute better solutions overall. Lastly, using the multilevelscheme reduces the time complexity by a factor of k comparedto Fennel, where k is the number of blocks a graph has tobe partitioned in. To this end, experiments indicate that ouralgorithm can partition huge networks on machines with smallmemory while computing better solutions than the previousstate-of-the-art in the streaming setting. At the same time ouralgorithm is faster than the previous state-of-the-art for largervalues of blocks k .

2. PRELIMINARIES2.1 Basic Concepts

Let G = ( V = { , . . . , n − } , E ) be an undirected graph with no multiple or self edges allowed, such that n = | V | and m = | E | . Let c : V → R ≥ be a node-weight function, andlet ω : E → R > be an edge-weight function. We generalize c and ω functions to sets, such that c ( V (cid:48) ) = (cid:80) v ∈ V (cid:48) c ( v ) and ω ( E (cid:48) ) = (cid:80) e ∈ E (cid:48) ω ( e ) . Let N ( v ) = { u : { v, u } ∈ E } denotethe neighbors of v . A graph S = ( V (cid:48) , E (cid:48) ) is said to be a subgraph of G = ( V, E ) if V (cid:48) ⊆ V and E (cid:48) ⊆ E ∩ ( V (cid:48) × V (cid:48) ) .When E (cid:48) = E ∩ ( V (cid:48) × V (cid:48) ) , S is an induced subgraph. Let d ( v ) bethe degree of node v , ∆ be the maximum degree of G , and ∆ V (cid:48) be the maximum degree of the subgraph induced by V (cid:48) ⊆ V .The problem named graph partitioning (GP) consists ofassigning each node of G to exactly one of k distinct blocks respecting a balancing constraint in order to minimize the edge-cut. More precisely, GPP partitions V into k blocks V ,. . . , V k input graph... ... initial c on t r a c t i on pha s e local improvementuncontractpartitioningcontract outputpartition un c oa r s en i ng pha s e Figure 1: Multilevel graph partitioning. The graph is recursivelycontracted to achieve smaller graphs. After the coarsest graphis initially partitioned, a local search method is used on eachlevel to improve the partitioning induced by the coarser level. (i. e., V ∪ · · · ∪ V k = V and V i ∩ V j = ∅ for i (cid:54) = j ), which iscalled a k -partition of G . The balancing constraint demandsthat the sum of node weights in each block does not exceeda threshold associated with some allowed imbalance (cid:15) . Morespeciﬁcally, ∀ i ∈ { , . . . , k } : c ( V i ) ≤ L max := (cid:6) (1 + (cid:15) ) c ( V ) k (cid:7) .The edge-cut of a k -partition consists of the total weightof the edges crossing blocks, i. e., (cid:80) i

In our approach, we use parts of the KaHIP multi-level graph partitioning framework. We shortly outline its maincomponents. Karlsruhe High Quality Partitioning – is a familyof graph partitioning programs that tackle the balanced graphpartitioning problem [34, 36]. The algorithms in KaHIP havebeen able to compute the best results in various benchmarks.It implements diﬀerent sequential and parallel algorithms tocompute k -way partitions and node separators. In this work, weuse parts of the sequential multilevel graph partitioner KaFFPa(Karlsruhe Fast Flow Partitioner) to obtain high-quality parti-tions of subgraphs. In particular, we use specialized partitioningtechniques based on multilevel size-constrained label propaga-tion [24] for coarsening. Computational Models.

The focus of this paper is to engi-neer a graph partitioning algorithm for a streaming input. Inparticular, the input is a stream of nodes alongside with theirrespective adjacency lists. The classic streaming model is theone-pass model, in which the nodes have to be permanentlyassigned to a block as soon as they are loaded from the input.As soon as assignment decisions of an algorithm for the currentnode depend on the previous decisions, an algorithm in themodel has to store the assignment of the previous loaded nodesand hence needs Ω( n ) space. We use an extended version ofthis model, which we call the buﬀered streaming model. Moreprecisely, a δ -sized buﬀer or batch of input nodes with theirneighborhood is repeatedly loaded. Partition/block assignmentdecisions have to be made after the whole buﬀer is loaded. Webelieve that this is a more reasonable approach in practice as acomputing server typically has enough memory to store much2ore than a single node and its neighbors. For the purposes ofthis work, we assume ≤ δ ≤ n . Every oﬄine GP algorithmﬁts in this computational model if we assume δ = n , howeverthis assumption is not realistic for massive real-world graphs asthis would require to store all edges of the network and hence Ω( m ) space. At the other extreme, δ = 1 is the setup adoptedby the one-pass streaming algorithms. While we investigate thedependence of our algorithm on this parameter, in practice theparameter will depend on the amount of available memory ona machine. Note, that the parameter can also be dynamicallychoosen such that the buﬀer is “full” if Θ( n ) space has beenloaded from the disk. Hence the buﬀered streaming modelasymptotically does not need more space than the one-passstreaming algorithm if this setting is used. In this paper, weuse a constant δ throughout the run of an algorithm. In re-streaming versions of those models, the whole graph can beloaded multiple times and block assignments can change ineach iteration. For a predeﬁned batch size δ , let t = (cid:100) n/δ (cid:101) bethe total number of batches of G to be consecutively loadedand assigned to blocks. Let i ∈ { , , . . . , t } enumerate thesebatches according to the input order. There has been a large amount of research on graph partition-ing. We refer the reader to [5, 8, 37] for extensive material andreferences. Here, we focus on results close to our main contribu-tion. The most successful general-purpose oﬄine algorithms tosolve the graph partitioning problem for huge real-world graphsare based on the multilevel approach. The basic idea of thisapproach can be traced back to multigrid solvers for systems oflinear equations [38], and its ﬁrst application to GP was in [4].The most commonly used formulation of the multilevel schemefor graph partitioning was proposed in [15]. However, thesealgorithms require the graph to ﬁt into main memory of a singlemachine or into the memory of a distributed machine if a dis-tributed memory parallel partitioning algorithm is used. Similarmultilevel schemes are used for other graph partitioning problemformulations such as DAG partitioning[26, 16, 25], hypergraphpartitioning [20, 1], or the node separator problem [14].Stanton and Kliot [39] introduced graph partitioning in thestreaming model and proposed many natural heuristics to solveit. These heuristics include one-pass methods such as hashing , chunking , and linear deterministic greedy , and some buﬀeredmethods such as greedy evocut . The evocut buﬀered model isdiﬀerent from our model as it is an extended one-pass model inwhich a buﬀer of ﬁxed size is kept and the algorithm can assignany node from the buﬀer to a block, rather than the one thathas been received most recently. In their experiments, lineardeterministic greedy had the best overall results in terms of totaledge-cut. In this algorithm, node assignments prioritize blockscontaining more neighbors while using a penalty multiplier tocontrol imbalance. In particular, it assigns a node v to theblock V i that maximizes | V i ∩ N ( v ) | ∗ λ ( i ) with λ ( i ) beinga multiplicative penalty deﬁned as (1 − | V i | L max ) . The intuitionhere is that the penalty avoids to overload blocks that arealready very heavy.Tsourakakis et al. [40] proposed a one-pass partitioning heuris-tic named Fennel , which is an adaptation of the widely-knownclustering objective modularity [6]. Roughly speaking, Fennelassigns a node v to a block V i , respecting a balancing threshold,in order to maximize an expression of type | V i ∩ N ( v ) | − f ( | V i | ) ,i. e., with an additive penalty. The assignment decision of Fen-nel is based on an interpolation of two properties: attraction to blocks with more neighbors and repulsion from blocks with morenon-neighbors. When f ( | V i | ) is a constant, the resulting objec-tive function coincides with the ﬁrst property. If f ( | V i | ) = | V i | ,the objective function coincides with the second property. Morespeciﬁcally, the authors deﬁned the Fennel objective functionby using f ( | V i | ) = α ∗ γ ∗ | V i | γ − , in which γ is a free param-eter and α = m k γ − n γ . After a parameter tuning made by theauthors, Fennel uses γ = , which provides α = √ k mn / .A restreaming approach to graph partitioning has been intro-duced by Nishimura and Ugander [27]. In this model, multiplepasses through the entire input are allowed, which enables itera-tive improvements. The authors proposed easily implementablerestreaming versions of linear deterministic greedy and Fennel:ReLDG and ReFennel. The objective function of ReFennelduring restreaming is unchanged, i. e., it uses the correct blockweights of the already and newly assigned vertices. In contrast,ReLDG modiﬁes its objective function during restreaming, i. e.,it maximizes | V i ∩ N ( v ) | ∗ ˜ λ ( i ) . The function ˜ λ ( i ) is deﬁned as (1 − | V + i | L max ) where V + i is the set of vertices assigned to block V i in the current iteration.Awadelkarim and Ugander [2] studied the eﬀect of node or-dering for streaming graph partitioning. The authors introducedthe notion of prioritized streaming , in which (re)streamed nodesare statically or dynamically reordered based on some priority.The authors propose a prioritized version of ReLDG, which usesmultiplicative weights of restreaming algorithms and adapts theordering of the streaming process inspired by balanced labelpropagation. Their experimental results consider a range ofstream orders, where a dynamic ordering based on their ownmetric ambivalence is the best regarding edge-cut, with a staticordering based on degree being nearly as good.Patwary et al. [28] proposed WStream, a greedy streamgraph partitioning algorithm that keeps a sliding stream window.This sliding window contains a few hundred nodes such thatit gives more information about the next node to be allocatedto a block. After a candidate vertex is greedily assigned to ablock, one more vertex is loaded from the input into the streamwindow to keep the window size. In their experiments, WStreamoutperforms the linear deterministic greedy algorithm in mostof the tested graphs, however solution quality and balance isstill not comparable to oﬄine partitioning methods as Metis.Jafari et al. [18] proposed a shared-memory partitioning algo-rithm based on the same buﬀered streaming computing modelwe use here. Their algorithm uses the idea of multilevel algo-rithm but with a simpliﬁed structure in which the LDG one-passalgorithm constitutes the coarsening step, the initial partitioning,and the local search during uncoarsening. Our work diﬀers fromtheirs in the fact that we focus on single-threaded execution,we use a traditional multilevel scheme, we construct a sophisti-cation model instead of processing the nodes of a batch directly,and our algorithm is inspired on the Fennel one-pass algorithm,which outperformed LDG in previously published studies.There are also a wide range of algorithms that focus onstreaming edge partitioning [30, 23, 33]. This is a diﬀerentgraph partitioning approach in which the input is a streamof edges rather than a stream of nodes. These edges shouldbe assigned to blocks of roughly equal size and the objectivenormally is the minimization of vertex-cut. Although closelyrelated from an application perspective, this partitioning problemis not the focus of this work.3 . BUFFERED GRAPH PARTITIONING We now present our main contribution, namely

HeiStream :a novel algorithm to solve graph partitioning in the buﬀeredstreaming model. We start this section by ﬁrst outlining theoverall structure behind HeiStream and then we present eachof the algorithm components in more detail.

We now explain the overall structure of HeiStream. We slidethrough the streamed graph G by repeating the following suc-cessive operations until all the nodes of G are assigned to blocks.First, we load a batch containing δ nodes alongside with theiradjacency lists. Second, we build a model B to be partitioned.This model represents the already partitioned vertices as wellas the nodes of the current batch. Third, we partition B witha multilevel partitioning algorithm to optimize for the Fennelobjective function. And ﬁnally, we permanently assign the nodesfrom the current batch to blocks. We summarize this generalstructure in Algorithm 1. Algorithm 1

Structure of Algorithm while G has not been completely streamed do Load batch of verticesBuild model B Run multi-level partitioning on model B Assign nodes of batch to permanent blocks end while

We build two diﬀerent models, which yield a runtime-qualitytradeoﬀ. We start by describing the basic model and thenextend this later. When a batch is loaded, we build the model B as follows. We initialize B as the subgraph of G induced by thenodes of the current batch. If the current batch is not the ﬁrstone, we add k artiﬁcial nodes to the model. These represent the k preliminary blocks in their current state, i.e., ﬁlled with thenodes from the previous batches, which were already assigned.The weight of an artiﬁcial node i is set to the weight of block V i . A node of the current batch is connected to an artiﬁcialnode i if it has a neighbor from the previous batch that hasbeen assigned to block V i . If this creates parallel edges, wereplace them by a single edge with its weight set to the sumof the weight of the parallel edges. Note that the basic modeldoes ignore edges towards nodes that will be streamed in futurebatches, i. e., batches that have not been streamed yet.Our extended model incorporates edges towards nodes fromfuture, not yet loaded, batches – if the stream contains suchedges. We call edges towards nodes of future batches ghostedges and the corresponding endpoint in the future batch ghostnode. Ghost nodes and ghost edges provide partial informationabout parts of G that have not yet been streamed. Hence,representing them in the model B enhances the partitioningprocess. Note though that simply inserting all ghost nodesand edges can overload memory in case there is an excessiveamount of them. Thus our approach consists of randomlycontracting the ghost nodes with one of their neighboring nodesfrom the current batch. Note that this contraction incrementsthe weight of a node within our model and ensures that if thereare more than one node from the current batch connected tothe same future node, then there will be edges between thosenodes in our model. Also note that the contraction ensures that the number of nodes in all models throughout the batchedstreaming process is constant. This prevents memory from beingoverloaded and makes it unnecessary to reallocate memory for B between successive batches. In order to give a lower priorityto ghost edges in comparison to the other edges, we divide theweight of each ghost edge by 2 in our model. Our approach to partition the model B is a multilevel versionof the algorithm Fennel. Recall that the multilevel scheme con-sists of three successive phases: coarsening, initial partitioning,and uncoarsening. We implement our multilevel scheme withinthe KaHIP framework. Note, however, that the artiﬁcial nodesin our model can become very heavy and are not allowed tochange their block. As a consequence, these nodes need specialhandling in the multilevel scheme. Moreover, the Fennel algo-rithm is designed for unweighted graphs. Hence, we introduce ageneralization of Fennel for weighted graphs that can be directlyemployed in a multilevel algorithm. We now explain the detailsof the generalized Fennel objective and our multilevel algorithmto partition the model. As already mentioned, adaptations are necessary to implementFennel for weighted graphs, in particular in a multilevel scheme.First, note that the original formulation of Fennel only works forunweighted graphs [40]. However, our model B has nodes andedges that are weighted – due to connections to the artiﬁcialnodes as well as future nodes that may be contracted into themodel. Moreover, the multilevel scheme creates a sequenceof weighted graphs. Note that to generalize the Fennel gainfunction, it is important to ensure that the gain of a node on acoarser level corresponds to the sum of the gains of the nodesthat it represents on the ﬁnest level. This way the algorithmgets a global view onto the optimization problem on the coarserlevels and has a very local view on the optimization problem onthe ﬁner levels. Moreover, it is ensured that on each level ofthe hierarchy the algorithm works towards optimizing the givenobjective function.Our generalization of the gain function of Fennel is as fol-lows. Let u be the node that should be assigned to a block ormoved to another block. Our generalized Fennel assigns u toa block i that maximizes (cid:80) v ∈ V i ∩ N ( u ) ω ( u, v ) − c ( u ) f ( c ( V i )) ,such that f ( c ( V i )) = α ∗ γ ∗ c ( V i ) γ − . Note that this is adirect generalization of the unweighted case. First, if the graphdoes not have edge weights, then the ﬁrst part of the equationbecomes | V i ∩ N ( u ) | which is the ﬁrst part of the Fennel ob-jective. Second, if the graph also does not have node weights,then the second part of the equation is the same as the secondpart of the equation in the Fennel objective. Moreover, observethat the penalty term f ( c ( V i )) in our objective is multipliedby c ( u ) . This is done to have the property stated above: thegain of our coarse vertex corresponds to the sum of the gains ofthe represented vertices on the ﬁnest level, i. e., moving u hasthe same gain as moving the represented vertices all at onceon the ﬁnest level of the graph. Finally, we multiply α by atuning factor in order to better adapt the Fennel function tothe multilevel nature of HeiStream. We now explain our multilevel algorithm to partition themodel B . Our coarsening phase is an adapted version of thesize-constraint label propagation approach implemented within4aHIP. To be self-contained, we shortly outline the coarsen-ing approach and then show how to modify it to be able tohandle artiﬁcial nodes. To compute a graph hierarchy, thealgorithm computes a size-constrained clustering on each leveland contract that to obtain the next level. The clustering iscontracted by replacing each cluster by a single node, and theprocess is repeated recursively until the graph becomes smallenough. This hierarchy is then used by the algorithm. Due tothe way we deﬁne contraction, it ensures that a partition of acoarse graph corresponds to a partition of all the ﬁner graphsin the hierarchy with the same edge-cut and balance. Notethat cluster contraction is an aggressive coarsening strategy.In contrast to matching-based approaches, it enables us todrastically shrink the size of irregular networks. The intuitionbehind this technique is that a clustering of the graph (onehopes) contains many edges running inside the clusters andonly a few edges running between clusters, which is favorablefor the edge cut objective.The algorithm to compute clusters is based on label propaga-tion [31] and avoids large clusters by using a size constraint , asdescribed in [24]. For a graph with n nodes and m edges, oneround of size-constrained label propagation can be implementedto run in O ( n + m ) time. Initially, each node is in its owncluster/block, i. e., the initial block ID of a node is set to itsnode ID. The algorithm then works in rounds. In each round,all the nodes of the graph are traversed. When a node v isvisited, it is moved to the block that has the strongest con-nection to v , i. e., it is moved to the cluster V i that maximizes ω ( { ( v, u ) | u ∈ N ( v ) ∩ V i } ) . Ties are broken randomly. Theprocess is repeated until the process has converged. We performat most L rounds, where L is a tuning parameter.In HeiStream, we have to ensure that two artiﬁcial nodesare not contracted together since each of them should remainin its previously assigned block. We achieve this by ignoringartiﬁcial nodes and artiﬁcial edges during the label propagation,i. e., artiﬁcial nodes cannot not change their label and nodesfrom the batch can not change their label to become a labelof an artiﬁcial node. As a consequence, artiﬁcial nodes are notcontracted during coarsening. Overall, we repeat the processof computing a size-constrained clustering and contracting it,recursively. As soon as the graph is small enough, i. e., ithas fewer nodes than an O (max( |B| /k, k )) threshold, it isinitially partitioned by an initial partitioning algorithm. Moreprecisely, we use the threshold max( |B| xk , xk ) , in which x is atuning parameter. Note that, for large enough buﬀer sizes, thisthreshold will be O ( |B| /k ) .When the coarsening phase ends, we run an initial partitioningalgorithm to compute an initial k -partition for the coarsestversion of B . That means that all nodes other than the artiﬁcialnodes, which are already assigned, will be assigned to blocks.To assign the nodes, we run our generalized Fennel algorithmwith explicit balancing constraint L max , i. e., the weight of noblock will exceed L max . To be precise, a node u will be assignedto a block i that maximizes (cid:80) v ∈ V i ∩ N ( u ) ω ( u, v ) − c ( u ) f ( c ( V i )) ,such that f ( c ( V i )) = α ∗ γ ∗ c ( V i ) γ − and c ( V i ∪ u ) ≤ L max .Note that the algorithm at this point considers all possible blocks i ∈ { , . . . , k } and hence has complexity proportional to k .However, as the coarsest graph has O ( |B| /k ) nodes, overallthe initial partitioning needs time which is linear in the size ofthe input model. When initial partitioning is done, we transferthe current solution to the next ﬁner level by assigning eachnode of the ﬁner level to the block of its coarse representative.At each level of the hierarchy, we apply a local search algorithm. Our local search algorithm is the same size-constraint labelpropagation algorithm we used in the contraction phase butwith a diﬀerent objective function. Namely, we assign a visitednode to the neighboring block which maximizes the generalizedFennel function described above. Note that, in contrast toinitial partitioning, only blocks of adjacent nodes are considered.Hence, one round of the algorithm can still be implementedto run in linear time in the size the current level. As in thecoarsening phase, artiﬁcial nodes cannot be moved betweenblocks. Diﬀerently though, we do not exclude the artiﬁcialnodes from the label propagation here. This is the case becausethe artiﬁcial nodes and their edges are used to compute thegeneralized Fennel gain function of the other nodes. As in theinitial partitioning, we use the explicit size constraint L max of G .Assuming geometrically shrinking graphs thoughout the hi-erarchy and assuming that the buﬀer size δ is larger than thenumber of blocks k , then the overall running time to partitiona batch is linear in the size of the batch. This is due to thefact that the overall running time of coarsening and local searchsums up to be linear in the size of the batch, while the overallrunning time of the initial partitioning depends linearly on thesize of the input model. Summing this up over all batches yieldsoverall linear running time O ( n + m ) . We now extend HeiStream to operate in a restreaming setting.During restreaming, the overall structure of the algorithm isroughly the same. Nevertheless we need to implement someadaptations which we explain in this section. The ﬁrst adapta-tions concern model construction. Recall that the nodes fromthe current batch are already assigned to blocks during theprevious pass of the input. We explicitly assign these nodesto their respective blocks in B . Furthermore, ghost nodes andedges are not needed to construct B . This is the case sinceall nodes from future batches are already known and assignedto blocks, i. e., these nodes will be represented in the artiﬁcialnodes. More precisely, we adapt the artiﬁcial nodes to representthe nodes from all batches except the current one.Since a partition of the graph is already given, we do not allowthe contraction of cut edges during restreaming in the coarseningphase of our multilevel scheme. That means that clustersare only allowed to grow inside blocks. As a consequence,we can directly use the partition computed in the previouspass as initial partitioning for B so we do not need to run aninitial partitioning algorithm. We now give some implementation details. Our implementa-tion of B is based on an adjacency array and consecutive nodeIDs. We reserve the ﬁrst δ IDs for the nodes from the currentbatch, which keep their global order. This means that, when weprocess the i th batch, nodes IDs can be easily converted fromour model B to G and the other way around by respectivelysumming or subtracting ( i − ∗ δ on their ID. Similarly, wereserve the last k IDs of B for the artiﬁcial nodes and keeptheir relative order for all batches. Note that this conﬁgurationseparates mutable nodes (nodes from current batch) and im-mutable nodes (artiﬁcial nodes). This allows us to eﬃcientlycontrol which nodes are allowed to move during coarsening,initial partitioning, and and local search. We keep an array ofsize n store the permanent block assignment of the nodes of G .To improve running time, we use approximate computation ofpowering in our Fennel function.5 . EXPERIMENTAL EVALUATION Methodology.

We performed the implementation of HeiStreamand competing algorithms inside the KaHIP framework (usingC++) and compiled them using gcc 9.3 with full optimiza-tion turned on (-O3 ﬂag). Since no oﬃcial versions of theone-pass streaming and restreaming algorithms are availablein public repositories, we implemented them in our framework.Our implementations of these algorithms reproduce the resultspresented in the respective papers and are optimized for run-ning time as much as possible. To this end, we implementedHashing, LDG, Fennel, and ReFennel. Multilevel LDG [18] isalso not publicly available but we could not implement it sinceit is a complex algorithm. We sent a message to the authorsrequesting an executable version of their algorithm for our testsbut we have not receive any response up to the moment thiswork was submitted. Hence, we can only compare HeiStreamagainst Multilevel LDG based on the results explicitly reportedin [18]. We have used two machines. Machine A has a twosix-core Intel Xeon E5-2630 processor running at . GHz, GB of main memory, and MB of L2-Cache. It runs UbuntuGNU/Linux 20.04.1 and Linux kernel version 5.4.0-48. MachineB has a four-core Intel Xeon E5420 processor running at . GHz, GB of main memory, and MB of L2-Cache. Themachine runs Ubuntu GNU/Linux 20.04.1 and Linux kernelversion 5.4.0-65. Most of our experiments were run on a singlecore of Machine A. The only exceptions are the experimentswith huge graphs, which were run on a single core of MachineB. When using machine A, we stream the input directly fromthe internal memory, and when using machine B that only has16GB of main memory, we stream the input from the hard disk.We use k ∈ { , , . . . , } for most experiments. We allowan imbalance of for all experiments (and all algorithms).All partitions computed by all algorithms have been balanced.Depending on the focus of the experiment, we measure runningtime and/or edge-cut. In general, we perform ten repetitionsper algorithm and instance using diﬀerent random seeds forinitialization, and we compute the arithmetic average of thecomputed objective functions and running time per instance.When further averaging over multiple instances, we use the geo-metric mean in order to give every instance the same inﬂuenceon the ﬁnal score . Unless explicitly mentioned otherwise, weaverage all results of each algorithm grouped by k . Given aresult of an algorithm A for k = k o , we express its value σ A (which can be objective or running time) using one or more ofthe following tools: improvement over an algorithm B , com-puted as (cid:0) σ B σ A − (cid:1) ∗ ; ratio , computed as (cid:0) σ A σ max (cid:1) with σ max being the maximum result for k o among all competitorsincluding A ; relative value over an algorithm B , computedas (cid:0) σ A σ B (cid:1) . We also present performance proﬁles . These plotsrelate the running time (resp. solution quality) of a group ofalgorithms to the fastest (resp. best) one on a per-instancebasis (rather than grouped by k ). Their x-axis shows a factor τ while their y-axis shows the percentage of instances for whichA has up to τ times the running time (resp. solution quality)of the fastest (resp. best) algorithm. Instances.

We get graphs from various sources to test our algo-rithm. Most of the considered graphs were used for benchmarkin previous works on graph partitioning. The graphs wiki-Talkand web-Google, as well as most networks of co-purchasing,roads, social, web, autonomous systems, citations, circuits, sim- ilarity, meshes, and miscellaneous are publicly available eitherin [21] or in [32]. Prior to our experiments, we convertedthese graphs to a vertex-stream format while removing paralleledges, self loops, and directions, and assigning unitary weightto all nodes and edges. We also use graphs such as eu-2005,in-2004, uk-2002, and uk-2007-05, which are available at the10 th DIMACS Implementation Challenge website [3]. Finally,we include some artiﬁcial random graphs. We use the name rggX for random geometric graph with X nodes where nodesrepresent random points in the unit square and edges connectnodes whose Euclidean distance is below . (cid:112) ln n/n . We usethe name delX for a graph based on a Delaunay triangulationof X random points in the unit square [17]. We use the name RHGX for random hyperbolic graphs [11, 29] with nodesand X ∗ edges. Basic properties of the graphs under con-sideration can be found in Table 1. For our experiments, wesplit the graphs in three disjoint sets. A tuning set for theparameter study experiments, a test set for the comparisonsagainst the state-of-the-art, and a set of huge graphs for speciallarger scale tests. In any case, when streaming the graphs weuse the natural given order of the nodes. We now present experiments to tune HeiStream and exploreits behavior. In our strategy, each experiment focuses on a singleparameter of the algorithm while all the other ones are keptinvariable. We start with a baseline conﬁguration consistingof the following parameters: rounds in the coarsening labelpropagation, round in the local search label propagation, x =64 in the expression of the coarsest model size, and we multiply α by a constant factor . . After each tuning experiment,we update the baseline to integrate the best found parameter.Unless explicitly mentioned otherwise, we run the experimentsof this section over all tuning graphs from Table 1 based on the extended model construction, i. e., including ghost nodes andedges, for a buﬀer size of

32 768 . Tuning.

We begin by evaluating how the number of labelpropagation rounds during local search aﬀects running timeand solution quality. In particular, we run conﬁgurations ofHeiStream with , , and rounds and report results in ﬁg-ures 2a and 2b. Observe that the results of the baseline haveconsiderably lower solution quality than the other ones overall.On the other hand, the results of the conﬁgurations with and rounds diﬀer slightly to each other. On average, they respec-tively improve solution quality . and . over the baseline.Regarding running time, they respectively increase . and . on average over the baseline. Since the variation of qualityfor these two conﬁgurations is not signiﬁcant, we decided tointegrate the fastest one among them in the algorithm, namelythe -round conﬁguration.Next we look at the parameter x associated with the expres-sion max (cid:0) |B| / xk, xk (cid:1) , which determines the size of the coars-est model. We run experiments for x = 2 i , with i ∈ { , . . . , } ,and report results in Figure 2c. We omit running time charts forthis experiment since the tested conﬁgurations present compa-rable behavior in this regard. Figure 2c shows that the baselinepresents the overall worst solution quality while x = 2 and x = 4 present the overall best solution quality. Averaging overall instances, these two latter conﬁgurations produce resultsrespectively . and . better than the baseline. In lightof that, we decide in favor of x = 4 to compose HeiStream.6 raph n m TypeTuning SetcoAuthorsCiteseer 227 320 814 134 CitationscitationCiteseer 268 495 1 156 647 Citationsamazon0312 400 727 2 349 869 Co-Purch.amazon0601 403 364 2 443 311 Co-Purch.amazon0505 410 236 2 439 437 Co-Purch.roadNet-PA 1 087 562 1 541 514 Roadscom-Youtube 1 134 890 2 987 624 Socialsoc-lastfm 1 191 805 4 519 330 SocialroadNet-TX 1 351 137 1 879 201 Roadsin-2004 1 382 908 13 591 473 WebG3 circuit 1 585 478 3 037 674 Circuitsoc-pokec 1 632 803 22 301 964 Socialas-Skitter 1 694 616 11 094 209 Aut.Syst.wiki-topcats 1 791 489 28 511 807 SocialroadNet-CA 1 957 027 2 760 388 Roadswiki-Talk 2 388 953 4 656 682 Websoc-ﬂixster 2 523 386 7 918 801 Socialdel22 4 194 304 12 582 869 Artiﬁcialrgg22 4 194 304 30 359 198 Artiﬁcialdel23 8 388 608 25 165 784 Artiﬁcialrgg23 8 388 608 63 501 393 ArtiﬁcialTest SetDubcova1 16 129 118 440 Mesheshcircuit 105 676 203 734 CircuitcoAuthorsDBLP 299 067 977 676 CitationsWeb-NotreDame 325 729 1 090 108 WebDblp-2010 326 186 807 700 CitationsML Laplace 377 002 13 656 485 MeshescoPapersCiteseer 434 102 16 036 720 CitationscoPapersDBLP 540 486 15 245 729 CitationsAmazon-2008 735 323 3 523 472 Similarityeu-2005 862 664 16 138 468 Webweb-Google 916 428 4 322 051 Webca-hollywood-2009 1 087 562 1 541 514 RoadsFlan 1565 1 564 794 57 920 625 MeshesLjournal-2008 1 957 027 2 760 388 SocialHV15R 2 017 169 162 357 569 MeshesBump 2911 2 911 419 62 409 240 Meshesdel21 2 097 152 6 291 408 Artiﬁcialrgg21 2 097 152 14 487 995 ArtiﬁcialFullChip 2 987 012 11 817 567 Circuitsoc-orkut-dir 3 072 441 117 185 083 Socialpatents 3 750 822 14 970 766 Citationscit-Patents 3 774 768 16 518 947 Citationssoc-LiveJournal1 4 847 571 42 851 237 Socialcircuit5M 5 558 326 26 983 926 Circuititaly-osm 6 686 493 7 013 978 Roadsgreat-britain-osm 7 733 822 8 156 517 RoadsHuge Graphsuk-2005 39 459 923 783 027 125 Webtwitter7 41 652 230 1 202 513 046 Socialsk-2005 50 636 154 1 810 063 330 Websoc-friendster 65 608 366 1 806 067 135 Socialer-fact1.5-s26 67 108 864 907 090 182 ArtiﬁcialRHG1 100 000 000 1 000 913 106 ArtiﬁcialRHG2 100 000 000 1 999 544 833 Artiﬁcialuk-2007-05 105 896 555 3 301 876 564 Web

Table 1: Graphs for experiments.

Exploration.

We start the exploration of open parameters byinvestigating how the buﬀer size aﬀects solution quality andrunning time. We use as baseline a buﬀer of nodes andsuccessively double its capacity until any graph from the tuningset in Table 1 ﬁts in a single buﬀer. We plot our results inﬁgures 2e and 2f. Note that solution quality and running timeincrease regularly as the buﬀer size becomes larger. This be-havior occurs because larger buﬀers enable more comprehensiveand complex graph structures to be exploited by our multilevel algorithm. As a consequence, there is a trade-oﬀ between so-lution quality and resource consumption. In other words, wecan improve partitioning quality at the cost of considerableextra memory and slightly more running time. Otherwise, wecan save memory as much as possible and get a faster par-titioning process at the cost of lowering solution quality. Inpractice, it means that HeiStream can be adjusted to producepartitions as reﬁned as possible with the resources availablein a speciﬁc system. For the extreme case of a single-nodebuﬀer, HeiStream behaves exactly as Fennel, while it behavesas an internal memory partitioning algorithm for the oppositeextreme case.Next, we compare the eﬀect of using the extended model,which incorporates ghost nodes, over the basic model, whichignores ghost nodes. Figures 2g and 2h displays the results.The results show that the extended model provides improvedquality over the basic model, with an . improvement onaverage. This happens because the presence of ghost nodesand edges expands the perspective of the partitioning algorithmto future batches. This has a similar eﬀect to increasing thesize of the buﬀer, but at no considerable extra memory cost.Regarding running time, the results show that the extendedmodel is consistently slower than the basic model for all valuesof k . Averaging over all instances, the extended model costs . more running time than the basic model. This increasein running time is explained by the higher number of edges to beprocessed when ghost nodes are incorporated in the model. Asa practical conclusion from the experiment, the extended modelcan be used for better overall partitions with no signiﬁcant extramemory but at the cost of extra running time. Otherwise, thebasic model can be used for a consistently faster execution atthe cost of a lower solution quality.Finally, we test to what extent solution quality can be im-proved by restreaming HeiStream multiple times. We investigatethis by restreaming each input graph times. We collect re-sults after each pass and plot in Figure 2d. The ﬁrst restreamgenerates a considerable quality jump, with an improvementover the baseline of . on average. Each following pass hasa decreasing impact on solution quality, which converges to bea . -improvement on average over the baseline after thelast pass. On the other hand, the running time has a roughlylinear increase for each pass over the graph. In practice, thisadds another degree of freedom to conﬁgure HeiStream forthe needs of real systems. Particularly, we can improve thepartition further with no extra memory at the cost of an integermultiplicative factor on the running time. In this section, we show experiments in which we compareHeiStream against the current state-of-the-art algorithms. Ex-cept when mentioned otherwise, these experiments involve allthe graphs from the Test Set in Table 1 and focus on twoparticular conﬁgurations of HeiStream, which we refer to asHeiStream(32k) and HeiStream(Int.). The ﬁrst conﬁgurationis based on batches of size 32 768, while the second one hasenough batch capacity to operate as an internal memory algo-rithm. For both conﬁgurations, we perform a single pass overthe input based on the extended model construction. Fromthe results of these two conﬁgurations, the reader can infer therelative behavior of other conﬁgurations.

Competitor Algorithms.

We can identify two groups of ap-proaches that constitute the state-of-the-art of streaming graph7 I m p r ov e m e n t i n % k (a) Quality improvement plot for label propagation rounds duringuncoarsening. T i m e R a ti o k (b) Running time ratio plot for label propagation rounds duringuncoarsening. -40-30-20-10 0 10 20 2 I m p r ov e m e n t i n % k x=1x=2x=4x=8x=16x=32x=64 (c) Quality improvement plot for parameter x from expression max( |B| / xk, xk ) . I m p r ov e m e n t i n % k (d) Quality improvement plot for restreaming. I m p r ov e m e n t i n % k (e) Quality improvement plot for buﬀer size. T i m e R a ti o k (f) Running time ratio plot for buﬀer size. I m p r ov e m e n t i n % k BasicExtended (g) Quality improvement plot for model construction. T i m e R a ti o k BasicExtended (h) Running time ratio plot for model construction.

Figure 2: Results for tuning and exploration experiments. Higher is better for quality improvement plots. Lower is better for runningtime ratio plots. 8 C u t E dge s ( % ) C u t E dge s ( % ) C u t E dge s ( % ) C u t E dge s ( % ) C u t E dge s ( % ) Multi.LDG(Int.) HeiStream(32k) HeiStream(Int.)

Figure 3: Comparison of HeiStream against Multilevel LDG for buﬀer containing the whole graph. -100-50 0 50 100 150 200 250 2 I m p r ov e m e n t i n % k HeiStream(Int.)HeiStream(32k)2-ReFennelFennelLDGHashing (a) Quality improvement plot over Fennel. i n s t a n ce s ≤ τ b e s t τ HeiStream(Int.)HeiStream(32k)2-ReFennelFennelLDGHashing (b) Quality performance proﬁle. T i m e R a ti o k HeiStream(Int.)HeiStream(32k)2-ReFennelFennelLDGHashing (c) Running time ratio plot. i n s t a n ce s ≤ τ f a s t e s t τ HeiStream(Int.)HeiStream(32k)2-ReFennelFennelLDGHashing (d) Running time performance proﬁle.

Figure 4: Results for comparison against state-of-the-art one-pass (re)streaming algorithms. Higher is better for quality improvementplots.partitioning. A ﬁrst group comprises one-pass (re)streamingalgorithms. From this group, we select Hashing, LDG, Fennel,and ReFennel as representatives for our tests. These algorithmsrequire O ( n + ∆) memory and operate in O ( kn + m ) runningtime with the exception of Hashing, which uses constant mem-ory and constant running time. Intuitively, the algorithms fromthis group seem to be the fastest ones since they make simpleand greedy decisions for each node based on local information.A second group of state-of-the-art algorithms comprehendsbuﬀered streaming partitioning algorithms. Multilevel LDG [18]is the representative of this group. As the implementation of the algorithm is not available, we use results from the respec-tive paper, whenever we report results of Multilevel LDG. Thisalgorithm requires O ( n + δ ∆) memory, which can be controlledby increasing or decreasing the buﬀer size δ . Moreover, it inher-its the pseudo-polynomial running time O ( nk + m ) from theone-pass streaming algorithm LDG, which it is based on. Forcomparison, recall that HeiStream requires O ( n + δ ∆) memoryand has running time O ( n + m ) provided that the buﬀer size isbig enough. Streaming approaches which speciﬁcally tackle thenode order in the input, such as [2] as well as internal memoryalgorithms such as Metis[19] and KaFFPa [35] are beyond the9ut Edges (%)Graph HeiStream(Int.) Multi.LDG(Int.) HeiStream(32k) -ReFennel Fennel LDG HashingDubcova1 k = 32 . Internal memory algorithms are are on the 2 left columns and streamingalgorithms are on the 5 right columns. We bold the best result for each graph for internal memory approaches and streamingapproaches. The results for Multilevel LDG for the ﬁve bottom graphs are missing, as the graphs where not part of their benchmarkset. Lower is better.scope of this work. Streaming approaches which speciﬁcallytackle the node order in the input, such as [2], as well as internalmemory algorithms, such as Metis[19] and KaFFPa [35], arebeyond the scope of this work. In particular, for instances whichﬁt in the memory of a machine, it is common knowledge that in-ternal memory algorithms are better than streaming algorithmsregarding partition quality. For the sake of reproducibility, weran Metis and the fast social version of KaFFPa over our TestSet. We found out that Fennel, on average, cuts a factor . more edges than Metis and a factor . more edges than thefast social version of KaFFPa. Regarding running time, Metis ison average . slower than Fennel, while the fsocial versionof KaFFPA is on average . slower than Fennel. Results.

We now present a detailed comparison of HeiStreamagainst the state-of-the-art. In the results, we refer to theinternal memory version of Multilevel LDG as Multi.LDG(Int.).Moreover, we refer to the restreaming version of Fennel as X -ReFennel, in which X is the number of passes over the in-put graph. We start by looking at the particular value k = 32 .Table 2 shows the percentage of edges cut in the partitionsgenerated by each algorithm for the graphs in the Test Set.HeiStream(Int.) and HeiStream(32k) outperform all the othercompetitors for the majority of instances. First, both outper-form Hashing for all graphs and LDG for out of the graphs. Next, they outperform Fennel in and instances,respectively. Against -ReFennel, these marks are respectively and . Considering only the 21 instances for which there are results reported for Multi.LDG(Int.) in literature, the algorithmsHeiStream(Int.) and HeiStream(32k) compute better partitionsfor and instances respectively.For a closer comparison against Multi.LDG(Int.), we showFigure 3. We plot edge-cut values for Multi.LDG(Int.) basedon results graphically reported in [18]. In this ﬁgure, weshow of HeiStream(Int.) and HeiStream(32k) for particu-lar graphs with k = 4 , , , , , . For all these instances,HeiStream(Int.) outperforms Multi.LDG(Int.) by a considerablemargin. Observe that HeiStream(32k), a setup of HeiStreamwith a relatively small buﬀer size, outperforms the internal mem-ory version of Multilevel LDG for the majority of instances. Inlight of that, we omit comparisons against buﬀered versions ofMultilevel LDG, which produces solutions of lower quality thanthe internal memory version. Finally, we stress again that wecannot show direct running time comparisons since we cannotrun both algorithms under the same computational conditions.We ran wider experiments over our whole Test Set for for diﬀerent values of k . To this end, we present Figure 4, in whichwe plot a quality improvement plot over Fennel, a running timeration plot, and performance proﬁles for solution quality andrunning time.Observe that HeiStream(Int.) produces solutions with highestquality overall. In particular, it produces partitions with smallestedge-cut for around of the instances and improves solutionquality over Fennel in . on average. We now provide someresults in which we exclude HeiStream(Int.), since it has accessto the whole graph. The best algorithm is HeiStream(32k),which produces the best solution quality in around of the10raph k HeiStream(Xk) Fennel LDG HashingX EC(%) RT(s) EC(%) RT(s) EC(%) RT(s) EC(%) RT(s)uk-2005 4 1024 ∗ ∗ This instance was slowed because it used around1GB of virtual memory during execution. 11nstances and improves solution quality over Fennel in . onaverage. It is followed by -ReFennel, which is the best testedalgorithms from the previous state-of-the-art. In particular,it computes the best partition for of the instances andimproves on average . over Fennel. LDG and Fennel comenext. Particularly, LDG ﬁnds the best partition for ofthe instances while Fennel does not ﬁnd the best partition fora considerable amount of the instances. Nevertheless, bothalgorithms have roughly the same solution quality on averageover all the instances. Finally, Hashing produces the worstsolutions, with . worse quality than Fennel on average.Regarding running time, Hashing is the fastest one for almostall instances, which is expected since it is the only one withconstant time complexity. The second fastest one is LDG, whoserunning time is a factor . higher than Hashing on average.HeiStream(32k) and HeiStream(Int.) come next with factors . and . slower than Hashing on average, respectively.Fennel and ReFennel come as the slowest algorithms of the test,being respectively factors . and . slower than Hashingon average. Note that both conﬁgurations of HeiStream arefaster than Fennel on average and considerably faster than itfor larger values of k . As we explained, this happens becauseFennel running time is proportional to k while the running timeof HeiStream does not depend on k under predeﬁned conditions.Observe that LDG is considerably faster than Fennel, althoughits running time also depends linearly on k . This big diﬀerencehappens because Fennel repeats the expensive operation ofpowering an amount of times proportional to k while LDG onlyuses cheap operations. We now switch to the main use case of streaming algorithms:computing high-quality partitions for huge graphs on smallmachines. The experiments in this section are based on the huge graphs listed in Table 1 and are run on the relativelysmall Machine B. Since our objective here is to demonstrate atechnical possibility of HeiStream, these experiments are notas wide as the previous ones. Namely, we only ran experimentsfor for k = { , , , , , , } and we did not repeateach test multiple times with diﬀerent seeds as in previousexperiments. For all instances, HeiStream performs a singlepass over the input based on the extended model construction.We refer to setups of HeiStream with speciﬁc buﬀer sizes asHeiStream( X k), in which the exact buﬀer size is X ∗ .In order to report the best possible partitions using the 16GBof main memory available in Machine B, we experimentallymaximized the buﬀer size used. Namely, we ran experimentswith X ∈ { , , , , , } . In Table 3, we reportdetailed per-instance results with the largest buﬀer size able torun on Machine B. For comparison, we also report in Table 3results for each instance using Fennel, LDG, and Hashing.The results show that HeiStream outperforms all the com-petitor algorithms regarding solution quality for most instances.Notably, HeiStream computes partitions with considerably loweredge-cut in comparison to the one-pass algorithms for 4 out ofthe tested graphs: uk-2005, sk-2005, uk-2007-05, and RHG1.For the social networks soc-friendster and twitter7, HeiStream isthe best for all instances except one, but the improvement overFennel and LDG is not so large as in the other instances. Oneoutlier can be seen on the network RHG2. While HeiStreamproduces fairly small edge-cut values, which are all below . ,Fennel does outperform both and LDG improves solution qualityeven further. Regarding running time, we see once again the same behavior observed in Section 4.3. Namely, the runningtime of HeiStream does not change considerably as k increaseswhile the running times of Fennel and LDG increase roughlyin the same proportion as k . The only unexpected runningtime obtained with HeiStream is associated with twitter7 for k = 256 . We found out that the cause of this behavior is thefact that, for this particular instance, HeiStream useed around1GB of virtual memory during its execution. Furthermore, notethat the running time of Fennel increases to the point in whichit becomes higher than the running time of HeiStream for 5 outof the 8 huge graphs tested.

5. CONCLUSION

In this work, we proposed HeiStream, a graph partitioningalgorithm based on the buﬀered streaming computational model.Our objective with this work is to start ﬁlling a gap betweenstreaming graph partitioning and oﬄine graph partitioning. Inparticular, HeiStream is projected to require low computationalresources as a streaming algorithm while integrating advancedpartitioning techniques which are typically used by oﬄine al-gorithms. To achieve this, we used an extended streamingmodel in which batches of nodes are consecutively loaded andprocessed before their nodes are permanently assigned to blocks.We engineered all aspects of HeiStream in order to use as muchof the available main memory as possible for the purpose of in-creasing partitioning quality. Particularly, our algorithm receivesa batch of nodes, builds a model graph that represents thenodes of the batch and the already known partition structure,and then applies a multilevel partitioning algorithm based onthe generalized Fennel objective.We present extensive experimental results in which we tuneHeiStream, explore its parameters, compare it against the pre-vious state-of-the-art, and demonstrate that it can computehigh-quality partitions for huge graphs on a fairly small machine.Compared to the previous state of-the-art of streaming graphpartitioning, HeiStream computes signiﬁcantly better solutionsthan known streaming algorithms while at the same time beingfaster in many cases. A characteristic of HeiStream is that itsrunning time does not depend on the number of blocks, whilethe previous state-of-the-art streaming partitioning algorithmshave running time proportional to this number of blocks. Inparticular, Fennel becomes signiﬁcantly slower than HeiStreamas the number of blocks increases.

Acknowledgements

Partially supported by DFG grant SCHU 2567/1-2.

6. REFERENCES [1] R. Andre, S. Schlag, and C. Schulz. Memetic multilevelhypergraph partitioning. In

Proceedings of the Geneticand Evolutionary Computation Conference (GECCO) ,pages 347–354, 2018.[2] A. Awadelkarim and J. Ugander. Prioritized restreamingalgorithms for balanced graph partitioning. In

Proceedingsof the 26th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , pages 1877–1887,2020.[3] D. A. Bader, H. Meyerhenke, P. Sanders, C. Schulz,A. Kappes, and D. Wagner. Benchmarking for graphclustering and partitioning. In

Encyclopedia of SocialNetwork Analysis and Mining , pages 73–82. Springer,2014.124] S. T. Barnard and H. D. Simon. A fast multilevelimplementation of recursive spectral bisection forpartitioning unstructured problems. In

Proc. of the 6thSIAM Conference on Parallel Processing for ScientiﬁcComputing , pages 711–718, 1993.[5] C. Bichot and P. Siarry, editors.

Graph Partitioning . Wiley,2011.[6] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer,Z. Nikoloski, and D. Wagner. On modularity clustering.

IEEE transactions on knowledge and data engineering ,20(2):172–188, 2007.[7] T. N. Bui and C. Jones. Finding Good ApproximateVertex and Edge Partitions is NP-Hard.

IPL ,42(3):153–159, 1992.[8] A. Bulu¸c, H. Meyerhenke, I. Safro, P. Sanders, andC. Schulz.

Recent Advances in Graph Partitioning , pages117–158. Springer International Publishing, Cham, 2016.[9] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, andS. Muthukrishnan. One trillion edges: Graph processing atfacebook-scale.

Proceedings of the VLDB Endowment ,8(12):1804–1815, 2015.[10] G. V. Demirci, H. Ferhatosmanoglu, and C. Aykanat.Cascade-aware partitioning of large graph databases.

VLDB J. , 28(3):329–350, 2019.[11] D. Funke, S. Lamm, U. Meyer, M. Penschuck, P. Sanders,C. Schulz, D. Strash, and M. von Looz.Communication-free massively distributed graphgeneration.

Journal of Parallel and Distributed Computing ,131:200–217, 2019.[12] M. R. Garey, D. S. Johnson, and L. Stockmeyer. SomeSimpliﬁed NP-Complete Problems. In

Proc. of the 6thACM Symposium on Theory of Computing , (STOC),pages 47–63. ACM, 1974.[13] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin. Powergraph: Distributed graph-parallelcomputation on natural graphs. In

Presented as part ofthe 10th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 12) , pages 17–30,2012.[14] W. W. Hager, J. T. Hungerford, and I. Safro. A multilevelbilinear programming algorithm for the vertex separatorproblem.

Comp. Opt. and Appl. , 69(1):189–223, 2018.[15] B. Hendrickson and R. Leland. A Multilevel Algorithm forPartitioning Graphs. In

Proc. of the ACM/IEEEConference on Supercomputing’95 . ACM, 1995.[16] J. Herrmann, M. Y. ¨Ozkaya, B. U¸car, K. Kaya, and ¨U. V.C¸ataly¨urek. Multilevel algorithms for acyclic partitioningof directed acyclic graphs.

SIAM J. Scientiﬁc Computing ,41(4):A2117–A2145, 2019.[17] M. Holtgrewe, P. Sanders, and C. Schulz. Engineering aScalable High Quality Graph Partitioner.

Proc. of the24th IPDPS , pages 1–12, 2010.[18] N. Jafari, O. Selvitopi, and C. Aykanat. Fastshared-memory streaming multilevel graph partitioning.

Journal of Parallel and Distributed Computing ,147:140–151, 2021.[19] G. Karypis and V. Kumar. A Fast and High QualityMultilevel Scheme for Partitioning Irregular Graphs.

SIAMJournal on Scientiﬁc Computing , 20(1):359–392, 1998.[20] G. Karypis and V. Kumar. Multilevel k -way hypergraphpartitioning. In Proceedings of the 36th Conference on Design Automation , pages 343–348, 1999.[21] J. Leskovec and A. Krevl. SNAP Datasets: Stanford largenetwork dataset collection. http://snap.stanford.edu/data , June 2014.[22] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski. Pregel: a systemfor large-scale graph processing. In

Proceedings of the2010 ACM SIGMOD International Conference onManagement of data , pages 135–146, 2010.[23] C. Mayer, R. Mayer, M. A. Tariq, H. Geppert, L. Laich,L. Rieger, and K. Rothermel. Adwise: Adaptivewindow-based streaming edge partitioning for high-speedgraph processing. In ,pages 685–695. IEEE, 2018.[24] H. Meyerhenke, P. Sanders, and C. Schulz. PartitioningComplex Networks via Size-constrained Clustering. In , volume 8504 of

LNCS . Springer, 2014.[25] O. Moreira, M. Popp, and C. Schulz. Graph partitioningwith acyclicity constraints. In ,volume 75 of

LIPIcs , pages 30:1–30:15, 2017.[26] O. Moreira, M. Popp, and C. Schulz. Evolutionarymulti-level acyclic graph partitioning. In

Proceedings ofthe Genetic and Evolutionary Computation Conference,(GECCO) , pages 332–339, 2018.[27] J. Nishimura and J. Ugander. Restreaming graphpartitioning: simple versatile algorithms for advancedbalancing. In

Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and datamining , pages 1106–1114, 2013.[28] M. A. K. Patwary, S. Garg, and B. Kang. Window-basedstreaming graph partitioning algorithm. In

Proceedings ofthe Australasian Computer Science Week Multiconference ,pages 1–10, 2019.[29] M. Penschuck, U. Brandes, M. Hamann, S. Lamm,U. Meyer, I. Safro, P. Sanders, and C. Schulz. Recentadvances in scalable network generation.

CoRR ,abs/2003.00736, 2020.[30] F. Petroni, L. Querzoni, K. Daudjee, S. Kamali, andG. Iacoboni. Hdrf: Stream-based partitioning forpower-law graphs. In

Proceedings of the 24th ACMInternational on Conference on Information andKnowledge Management , pages 243–252, 2015.[31] U. N. Raghavan, R. Albert, and S. Kumara. Near lineartime algorithm to detect community structures inlarge-scale networks.

Physical Review E , 76(3):036106,2007.[32] R. A. Rossi and N. K. Ahmed. The network datarepository with interactive graph analytics andvisualization. http://networkrepository.com , 2015.[33] H. P. Sajjad, A. H. Payberah, F. Rahimian, V. Vlassov,and S. Haridi. Boosting vertex-cut partitioning forstreaming graphs. In , pages 1–8. IEEE, 2016.[34] P. Sanders and C. Schulz. KaHIP – Karlsruhe HighQualtity Partitioning Homepage. http://algo2.iti.kit.edu/documents/kahip/index.html .[35] P. Sanders and C. Schulz. Engineering Multilevel GraphPartitioning Algorithms. In

Proc. of the 19th EuropeanSymp. on Algorithms , volume 6942 of

LNCS , pages1369–480. Springer, 2011.[36] P. Sanders and C. Schulz. Think Locally, Act Globally:Highly Balanced Graph Partitioning. In , LNCS. Springer, 2013.[37] C. Schulz and D. Strash. Graph partitioning:Formulations and applications to big data. In

Encyclopedia of Big Data Technologies . 2019.[38] R. V. Southwell. Stress-Calculation in Frameworks by theMethod of “Systematic Relaxation of Constraints”.

Proc.of the Royal Society of London , 151(872):56–95, 1935.[39] I. Stanton and G. Kliot. Streaming graph partitioning forlarge distributed graphs. In

Proceedings of the 18th ACMSIGKDD international conference on Knowledge discoveryand data mining , pages 1222–1230, 2012.[40] C. Tsourakakis, C. Gkantsidis, B. Radunovic, andM. Vojnovic. Fennel: Streaming graph partitioning formassive scale graphs. In