[PDF] On Greedy Approaches to Hierarchical Aggregation

Abstract

We analyze greedy algorithms for the Hierarchical Aggregation (HAG) problem, a strategy introduced in [Jia et al., KDD 2020] for speeding up learning on Graph Neural Networks (GNNs). The idea of HAG is to identify and remove redundancies in computations performed when training GNNs. The associated optimization problem is to identify and remove the most redundancies. Previous work introduced a greedy approach for the HAG problem and claimed a 1-1/e approximation factor. We show by example that this is not correct, and one cannot hope for better than a 1/2 approximation factor. We prove that this greedy algorithm does satisfy some (weaker) approximation guarantee, by showing a new connection between the HAG problem and maximum matching problems in hypergraphs. We also introduce a second greedy algorithm which can out-perform the first one, and we show how to implement it efficiently in some parameter regimes. Finally, we introduce some greedy heuristics that are much faster than the above greedy algorithms, and we demonstrate that they perform well on real-world graphs.

Full PDF

OOn Greedy Approaches to Hierarchical Aggregation ∗ Alexandra Porter

Department of Computer ScienceStanford University

Stanford, CA [email protected]

Mary Wootters

Departments of Computer Scienceand Electrical EngineeringStanford University

Stanford, CA [email protected]

Abstract

We analyze greedy algorithms for the

Hierarchical Aggregation (HAG) problem, a strategy introducedin [Jia et al., KDD 2020] for speeding up learning on Graph Neural Networks (GNNs). The idea of HAGis to identify and remove redundancies in computations performed when training GNNs. The associatedoptimization problem is to identify and remove the most redundancies.Previous work introduced a greedy approach for the HAG problem and claimed a 1-1/e approxima-tion factor. We show by example that this is not correct, and one cannot hope for better than a 1/2approximation factor. We prove that this greedy algorithm does satisfy some (weaker) approximationguarantee, by showing a new connection between the HAG problem and maximum matching problemsin hypergraphs. We also introduce a second greedy algorithm which can out-perform the ﬁrst one, andwe show how to implement it eﬃciently in some parameter regimes. Finally, we introduce some greedyheuristics that are much faster than the above greedy algorithms, and we demonstrate that they performwell on real-world graphs.

In this work, we analyze an optimization problem that arises from

Hierarchical Aggregration (HAG), astrategy that was recently introduced in [4] for speeding up learning on

Graph Neural Networks (GNNs).At a high level, HAG identiﬁes redundancies in the computations performed in training GNNs andelimates them. This gives rise to an optimization problem, the

HAG problem, which is to ﬁnd and eliminatethe most redundancies possible. In this paper, we study greedy algorithms for this optimization problem.Our contributions are as follows.1. The work [4] proposed a greedy algorithm, which we call

FullGreedy , for the HAG optimizationproblem, and claimed that it gives a 1 − /e approximation. Unfortunately, this is not true, andwe show by example that one cannot hope for a better than a 1 / FullGreedy in Theorem 13. In more detail, we are able to establish a d (1 − /e ) approximation ratio for a related objective function, where d is a parameter of the problem( d = 2 is a reasonable value).2. We propose a second greedy algorithm, PartialGreedy , for the HAG optimization problem. Weshow by example that this algorithm can obtain strictly better results than

FullGreedy mentionedabove. It is not obvious that

PartialGreedy is eﬃcient, and in Theorem 12 we show that it can beimplemented in polynomial time in certain parameter regimes. ∗ This work is partially supported by NSF grant CCF-1657049 and NSF CAREER grant CCF-1844628. AP is partiallysupported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518. a r X i v : . [ c s . D S ] F e b . While both of the greedy algorithms we study are “eﬃcient,” in the sense that they are polynomialtime, they can still be slow on massive graphs. To that end, we introduce greedy heuristics anddemonstrate that they perform well on real-world graphs.Our approach is based on a new connection between the HAG problem and a problem related to maximumhypergraph matching. We use this connection both in our approximation guarantees for FullGreedy andin our eﬃcient implementation of

PartialGreedy .In Section 2, we deﬁne the HAG problem and set notation. In Section 3, we deﬁne algorithms

Full-Greedy and

PartialGreedy . In Section 4, we discuss the eﬃciency of these algorithms and show thatboth can be implemented in polynomial time in certain parameter regimes. In Section 5, we give a newapproximation guarantee for

FullGreedy , and show by example that

PartialGreedy can do strictlybetter. In Section 6, we compare

FullGreedy and

PartialGreedy in practice. We then discuss fastergreedy heuristics and show empirically that they perform well.

Let G = ( V, E ) be a directed graph that represents some underlying data. For example, G could arisefrom a social network, a graph of transactions, and so on. The goal of a GNN deﬁned on G is to learn a representation h v ∈ R s for each v ∈ V , with the goal of minimizing some loss function L ( { h v : v ∈ V } ),which is typically designed so that the representations h v can be used for prediction (for example, classifyingunlabeled nodes). Graph neural networks were originally introduced by [9], and have numerous extensionsand applications [3, 5, 10, 11, 1].Learning these representations h v follows the abstract process depicted in Algorithm 1. Each node v callsa function Aggregate on the values h w for w ∈ Γ in ( v ), resulting in an aggregated value a v . Here, Γ in ( v )represents the set of nodes w ∈ V so that ( w, v ) ∈ E . Next, the node v calls a function Update on a v andthe current value of h v to obtain an updated h v . Then this repeats. Here, the function Aggregrate canbe as simple as a summation (e.g. in GCN [5]), or it can be more complicated (e.g. in GraphSAGE-P [3]).In this work, we assume that

Aggregate does not depend on the order of its inputs and can be appliedhierarchically. For example, we would have:

Aggregate ( Aggregate ( x, y ) , Aggregate ( z, w ))= Aggregate ( Aggregate ( x, w ) , Aggregate ( z, y ))= Aggregate ( x, y, z, w ) . This is often the case in GNNs (see [4] for more details).

Algorithm 1

Abstract GNN aggregation [4]

Require:

Graph G = ( V, E ), depth K Initialize h (0) v appropriately. (cid:46) Typically, set h (0) v to the feature vector x v . for k = 1 , ..., K do for v ∈ V do a ( k ) v ← Aggregate ( { h ( k − u | u ∈ Γ in ( v ) } ) h ( k ) v ← Update (( a ( k ) v , h ( k − v )) Throughout the paper we work with directed graphs, but if the underlying graph is undirected we may treat it as a directedgraph by adding directed edges in both directions. A typical set-up for a GNN might be the following. The representations h v are some function f of the features x u andrepresentations h u of the nodes u in the neighborhood of v ; a prediction o v is a function g of the x v as well as of h v ; and both f and g are fully connected feed-forward neural networks. However, the details of GNNs will not actually matter for this work. BC D E (a) A B C DC : A , B D : A , B , C E : A , B , D (b) A B C DC : A , B D : A , B , C E : A , B , DA ⊕ B (c) Figure 1: Example of hierarchical aggregation: (a) shows a directed graph G = ( V, E ), (b) shows the GNNcomputation graph G , and (c) shows a possible HAG computation graph ˆ G , equivalent to G . The notation[ D : A, B, C ] means that the node D is requesting information from nodes A, B, C . The notation A ⊕ B means that this intermediate node computes Aggregate ( A, B ). The starting point for our work is the paper [4], which showed that there are signiﬁcant improvements tobe made (up to 2.8x, empirically), by cutting out redundant computations in Algorithm 1. To see whereredundant computations might arise, suppose that two nodes u, v ∈ G have a large shared out-neighborhoodΓ out ( u ) ∩ Γ out ( v ). In Algorithm 1, we would call Aggregate on the nodes u and v many times, once for eachnode in this shared out-neighborhood. However, we can save computation by introducing an intermediatenode m so that Γ in ( m ) = Γ in ( u ) ∩ Γ in ( v ) and Γ out ( m ) = Γ out ( u ) ∩ Γ out ( v ), and then disconnecting u and v from its original shared out-neighborhood. Then, we only call Aggregate on u and v once, and we canuse the stored computation many times. This process is shown in Figure 1.The Hierarchical Aggregation (HAG) problem is to ﬁnd the best way to introduce such intermediatenodes. We formally deﬁne the problem below. Deﬁnition 1 (GNN Computation Graph) . Given a directed graph G = ( V, E ) , the GNN ComputationGraph G for G is a bipartite graph ( L, R, E ) , where L and R are copies of V , and, for u ∈ L and v ∈ R , ( u L , v R ) ∈ E if and only if ( u, v ) ∈ E . We use Γ in ( v ) to denote the set of in-neighbors of a vertex v in G ,and we use Γ out ( v ) to denote the set of out-neighbors of v in G . Deﬁnition 2 (HAG Computation Graph) . Given a directed graph G = ( V, E ) , a HAG ComputationGraph ˆ G for G is a graph ( ˆ V , ˆ E ) , where ˆ V = L ∪ M ∪ R and L and R are copies of V . ˆ E contains directededges from L to M , from M to R , and possibly within M , and the following property holds. For everydirected edge ( u, v ) ∈ E , there is a unique directed path from u L to v R in ˆ G . We use ˆΓ in ( v ) to denotethe set of in-neighbors of a vertex v in ˆ G , and we use ˆΓ out ( v ) to denote the set of out-neighbors edges of v in ˆ G . When no edges in ˆ E have both endpoints in M , so ˆ G is tripartite, we call this a single-layer HAGcomputation graph. When there exists integer d such that for all w ∈ M , | ˆΓ in ( w ) | = d , then we call ˆ G a d-HAG computation graph . See Figure 1 for an example of a GNN computation graph and a HAG computation graph arising froma directed graph G . We say that a GNN computation graph G and a HAG computation graph ˆ G are equivalent if they are both computation graphs for the same underlying graph G .We also use the cover function, as deﬁned by [4]. The cover of a vertex v is just the set of all nodes in L that eventually feed into it. Deﬁnition 3.

For a vertex v in a HAG computation graph ˆ G = ( L ∪ M ∪ R, ˆ E ) , the cover of v is deﬁned as cover( v ) = { w ∈ L : there is a directed path from w to v in ˆ G} . We will subsequently assume any HAG computation graph has the property that cover( m ) is a distinctset for all distinct m ∈ M . This is without loss of generality, because if nodes m and m have the samecover, then they can perform the same function in the HAG graph and one of them can be removed.3iven a HAG computation graph, we can re-organize the computation in Algorithm 1 in order to aggregatecomputations at the intermediate nodes in M . This process is shown in Algorithm 2. Note that we need anordering of M such that for any v ∈ M , the vertices in ˆΓ in ( v ) ∩ M appear in the sequence before v . Sinceˆ G is a DAG, such a sequence can easily be constructed. Algorithm 2

Abstract GNN aggregation with added intermediate nodes [4]

Require:

HAG Computation Graph ˆ G = ( ˆ V , ˆ E ); depth K . Require:

Sequence M = { m i } | M | i =1 for M ⊂ ˆ V such that every v ∈ M appears exactly once and after all itsin-neighbors Initialize h (0) v appropriately. for k = 1 , ..., K do for i = 1 , ..., | M | do a ( k ) m i ← Aggregate ( { h ( k − u | u ∈ ˆΓ in ( m i ) } ) for v ∈ R do a ( k ) v ← Aggregate ( { h ( k − u | u ∈ ˆΓ in ( v ) } ) h ( k ) v ← Update (( a ( k ) v , h ( k − v ))In [4], the following cost function for a computation graph was considered. We say that the cost of acomputation graph G with vertices V and right-hand side R (either a HAG computation graph or a GNNcomputation graph) is cost( G ) = c Agg (cid:88) w ∈V ( | Γ in ( w ) | −

1) + c Up · | R | where c Agg and c Up are some constants representing the cost of an aggregation and an update respectively.The reason for this cost function is that the cost to do an aggregation at a node w ∈ V is proportional tothe number of items in the aggregation, minus one. That is, one can “aggregate” a single item for free, andthe cost grows linearly as we add more items. The second term counts the cost of each update. We deﬁneto value of a HAG computation graph ˆ G to be proportional to the amount of cost that it saves. Deﬁnition 4.

1) + (cid:88) w ∈ R ( | ˆΓ in ( w ) | − (cid:33) = (cid:88) w ∈ R  (cid:88) v ∈ ˆΓ in ( w ) ∩ M ( | cover( v ) | −  − (cid:32) (cid:88) v ∈ M ( | ˆΓ in ( v ) | − (cid:33) = (cid:88) v ∈ M (cid:104) | ˆΓ out ( v ) | ( | cover( v ) | − − ( | ˆΓ in ( v ) | − (cid:105) , where in the second line we have used the equivalence of G and ˆ G to say that Γ in ( w ) is equal to the disjointunion (cid:83) v ∈ ˆΓ in ( w ) cover( v ), in the third we have combined summations over w ∈ R and used the fact that v ∈ ˆΓ in ( w ) \ M implies v ∈ L and thus | cover( v ) | = 1, and in the fourth we have switched the order ofsummations and used the fact that each v in (cid:80) w ∈ R (cid:80) v ∈ ˆΓ in ( w ) appears | ˆΓ out ( v ) | times.4 .3 The HAG Problem Given the above setup, we can formally deﬁne the HAG problem. We additionally take two parameters d and k . The parameter d is a bound on the left-degree of the aggregation nodes (for example, the work [4]considered d = 2 in their algorithm). The parameter k is a budget on the number of intermediate nodesallowed. Deﬁnition 5 (HAG problem) . Let d be an integer. The d -HAG problem is the following. Given a graph G and a node budget k , ﬁnd a HAG computation graph ˆ G = ( ˆ V , ˆ E ) for G with the largest value, so that | M | ≤ k and | ˆΓ in ( w ) | = d for all w ∈ M . We also deﬁne a single-layer variation of the problem, which is to ﬁnd the best way to add intermediatenodes in a way so that the resulting graph is tri-partite. The single layer variation is faster to compute andwe show empirically that single-layer solutions achieve almost as much value as general multi-layer solutions.

Deﬁnition 6 (single-layer HAG problem) . Let d be an integer. The single-layer d -HAG problem is deﬁnedas the d -HAG problem with the additional constraint that ˆ G be tripartite. We note that if ˆ G is a single-layer d -HAG computation graph, value can be simpliﬁed: value ( ˆ G ) = (cid:80) v ∈ M ( | ˆΓ out ( v ) | − | ˆΓ in ( v ) | − We study two natural greedy algorithms for the HAG problem. We call these two algorithms

FullGreedy and

PartialGreedy . Intuitively,

FullGreedy greedily choose an internal node, with all of its incomingand outgoing edges, and ﬁxes it. On the other hand,

PartialGreedy greedily chooses an internal nodewith all of its incoming edges, but re-optimizes the outgoing edges when each new internal node is added.That is,

FullGreedy is “fully” greedy in the sense that it makes a greedy choice for every edge, while

PartialGreedy is only “partially” greedy in the sense that it makes a greedy choice for the incomingedges, subject to fully optimizing over the outgoing edges.To formally describe these algorithms, we deﬁne an additional function on HAG computation graphs.

Deﬁnition 7.

Given HAG computation graph ˆ G = ( ˆ V , ˆ E ) with ˆ V = L ∪ M ∪ R , for X, Y ∈ {

L, M, R } let T ˆ G ( X, Y ) denote the set edges in ˆ G that either connect X and Y in ˆ G , or connect Y to Y in ˆ G : T ˆ G ( X, Y ) := ( ˆ

E ∩ ( X × Y )) ∪ ( ˆ E ∩ ( Y × Y )) . We begin with the algorithm

FullGreedy . This algorithm was proposed by [4], and works as follows.At each step, it chooses the internal node—complete with all ingoing and outgoing edges—that will increasevalue( ˆ G ) by the most. This is shown in Algorithm 3.We next consider a greedy algorithm, PartialGreedy , in which the edges between the intermediatenodes and receiving nodes are re-assigned at each iteration. In particular, at the i th step the edge set T ˆ G i ( M i , R ) is chosen to be optimal given M i and T ˆ G i ( L, R ), rather than constructed by adding edges to theset T ˆ G i − ( M i − , R ) from the previous step. Algorithm 4 describes this process. Remark 8.

We note that both

FullGreedy and

PartialGreedy can be easily modiﬁed to ﬁnd a single-layer solution. In

FullGreedy (Algorithm 3), we simply take the arg max over L instead of L ∪ M i − . In PartialGreedy (Algorithm 4), we replace “ C j ⊆ L ∪ M i − s.t. | C j | = d ” in Line 6 with “ C j ⊆ L s.t. | C j | = d ”. In the next two sections, we analyze the eﬃciency and approximation guarantees of both

FullGreedy and

PartialGreedy . In this section, we discuss the eﬃciency of the two greedy algorithms presented above. We note that

FullGreedy (Algorithm 3) is clearly polynomial time if d is constant. In particular, the argmax can benaively implemented in time O ( n d ). 5 lgorithm 3 Greedy Algorithm

FullGreedy

Require:

GNN Computation Graph G = ( L, R, E ); aggregation node limit k , aggregation in-degree d . M ← ∅ ˆ E ← E ˆ G i ← ( L ∪ M ∪ R, ˆ E ) for i = 1 , ..., k do C ← arg max C ⊆ L ∪ M i − s.t. | C | = d | (cid:84) v ∈ C ˆΓ out ( v ) ∩ R | (cid:46) Find the set C of size d to maximize the number of nodes in R that request all of the nodes in C . R C ← (cid:84) v ∈ C ˆΓ out ( v ) ∩ R M i ← M i − ∪ { v i } (cid:46) add a new vertex v i to M Construct the new edge set ˆ E i : • ˆ E i ← ˆ E i − • Add edge ( (cid:96), v i ) to ˆ E i for all (cid:96) ∈ C . • Add edge ( v i , r ) to ˆ E i for all r ∈ R C . • Remove any edges ( (cid:96), r ) from ˆ E i with (cid:96) ∈ C and r ∈ R C . ˆ G i ← ( L ∪ M i ∪ R, ˆ E i ) Algorithm 4

Greedy Algorithm

PartialGreedy

Require:

GNN Computation Graph G = ( L, R, E ); aggregation node limit k , aggregation in-degree d . M ← ∅ ˆ E ← E ˆ G ← ( L ∪ M ∪ R, ˆ E ) for i = 1 , ..., k do Suppose that ˆ G i − has vertices ˆ V i − = L ∪ M i − ∪ R . for C ⊆ L ∪ M i − s.t. | C | = d do M i ← M i − ∪ { v i } (cid:46) Add a new vertex v i S C =  ˆ G ( C ) = ( L ∪ M i ∪ R, ˆ E ( C ) ) : ˆ G ( C ) is a d-HAG computation graphequivalent to G and T ˆ G ( C ) ( L, M i ) = T ˆ G i − ( L, M i − ) ∪ ( C × { v i } )  (cid:46) S C is the set of all graphs ˆ G ( C ) that extend the left-hand side T ˆ G i − ( L, M ) of ˆ G i − by adding anintermediate node v with Γ in ( v ) = C . ˆ G ( C ) opt ← arg max ˆ G ( C ) ∈S C value( ˆ G ( C ) ) ˆ G i ← arg max C value( ˆ G ( C ) opt ) 6n the other hand, it is not clear that PartialGreedy (Algorithm 4) is even polynomial time (in n ),because it is not clear how to solve the optimization problem in line 9. However, we show that in fact thiscan be re-cast as a matching problem in hypergraphs, which is eﬃcient in certain parameter regimes. To dothis, we need a few more deﬁnitions. Deﬁnition 9.

Let ˆ G = ( L ∪ M ∪ R, ˆ E ) be a HAG computation graph. We deﬁne the partial HAG computationgraph induced by ˆ G to be ˆ P = ( L ∪ M, T ˆ G ( L, M )) , the induced subgraph on the vertices L ∪ M .Given a partial HAG computation graph ˆ P , and a GNN computation graph G = ( L ∪ R, E ) , let S ( ˆ P , G ) denote the set of HAG computation graphs ˆ G on the vertices L ∪ M ∪ R , so that:(a) ˆ G is equivalent to G , and(b) ˆ P is a partial computation graph induced by ˆ G . In this language, the arg max in Line 9 of Algorithm 3 is maximizing over the set S ( ˆ P ( C ) , G ), whereˆ P ( C ) is the partial HAG computation graph induced by ˆ G i − with an additional intermediate vertex v withˆΓ in ( v ) = C .Below, we show that eﬃciently computing this arg max is equivalent to solving a hypergraph matchingproblem. Deﬁnition 10.

Let G = ( L ∪ R, E ) be a GNN computation graph, and let ˆ P be a partial HAG computationgraph with vertices L ∪ M . Then for r ∈ R , deﬁne H r = H r ( ˆ P , G ) to be the hypergraph with vertices L andedges { cover( v ) : v ∈ M and cover( v ) ⊆ Γ in ( r ) } . For an edge e = cover( v ) of H r , deﬁne the weight of e to be | cover ( v ) | − .Let H = H ( ˆ P , G ) be the disjoint union of the H r , for r ∈ R . (That is, the vertices of H are | R | disjointcopies of L , and the edges on the r th copy correspond to the edges in H r .) Lemma 11.

Let G = ( L ∪ R, E ) be a GNN computation graph, and let ˆ P be a partial HAG computationgraph. Let H = H ( ˆ P , E ) be as in Deﬁnition 10.Let M ( H ) denote the set of matchings in H . Then there is a bijection ϕ : M ( H ) → S ( ˆ P , G ) , so that for a matching N ∈ M ( H ) , value( ϕ ( N )) = value( N ) − c ( ˆ P ) , where the value of a matching is deﬁned as the sum of the weights of the edges in that matching, and where c ( ˆ P ) is a constant that depends only on the partial HAG graph ˆ P . When ˆ P is a partial d -HAG graph with k intermediate nodes, c ( ˆ P ) = k ( d − .In particular, if N is a maximum weighted hypergraph matching for H , then ϕ ( N ) is a maximum valueHAG computation graph in S ( ˆ P , G ) .Proof. We deﬁne the bijection ϕ as follows. Let N be a matching in H , and let N r denote the restriction of N to H r , recalling that H is the disjoint union of H r for r ∈ R . Suppose that the edges in N r correspond tosets cover( v ) for v ∈ C r , for some set C r . (Notice that the edges in N r will have this form by the deﬁnitionof H r .) Then deﬁne ϕ ( N ) to be the HAG computation graph ˆ G so that the partial HAG computation graphinduced by ˆ G is ˆ P , and so that ˆΓ in ( r ) = C r ∪ (cid:32) Γ in ( r ) \ (cid:91) v ∈ C r cover( v ) (cid:33) (1)for r ∈ R . Notice that ˆ P sets the edge structure between L and M and within M , so specifying ˆΓ in ( r ) foreach r ∈ R completes the description of ˆ G .We now verify that ˆ G = ϕ ( N ) is an element of S ( ˆ P , G ). First, by construction it induces ˆ P as a partialHAG graph. Second, ˆ G = ( L ∪ M ∪ R, ˆ E ) is a HAG computation graph that is equivalent to G = ( L ∪ R, E ).7o see this, consider any edge ( (cid:96), r ) ∈ E . We need to show that there is a unique path from (cid:96) to r in ˆ G .This is true because either (cid:96) is contained in exactly one set cover( v ) for v ∈ Γ in ( r ), in which case the pathis the one that goes through v ; or (cid:96) is not in any sets cover( v ), in which case the edge ( (cid:96), r ) is added to ˆ E bydeﬁnition in (1). It cannot be the case that (cid:96) is contained in cover( v ) for multiple v ∈ Γ in ( r ), because N r was a matching.Next, we show that ϕ is a bijection. To see this, let ˆ G ∈ S ( ˆ P , G ). Then observe that ϕ − ( ˆ G ) is given bythe matching N that is the disjoint union of matchings N r for r ∈ R , so that N r includes the edges cover( v )for v ∈ ˆΓ in ( r ) ∩ M .Finally, we establish the claim about the values of N and ϕ ( N ). Let N = ϕ − ( ˆ G ) for some ˆ G ∈ S ( ˆ P , G ).By the deﬁnition of the weights, and by the construction of N , we havevalue( N ) = (cid:88) r (cid:88) v ∈ ˆΓ in ( r ) ∩ M ( | cover( v ) | − . On the other hand, by the deﬁnition of the value of a HAG computation graph, we havevalue( ˆ G ) = (cid:88) v ∈ M (cid:104) | ˆΓ out ( v ) | ( | cover( v ) | − − ( | ˆΓ in ( v ) | − (cid:105) = (cid:88) v ∈ M | ˆΓ out ( v ) | ( | cover( v ) | − − (cid:88) v ∈ M ( | ˆΓ in ( v ) | − (cid:88) r ∈ R (cid:88) v ∈ ˆΓ in ( r ) ∩ M ( | cover( v ) | − − (cid:88) v ∈ M ( | ˆΓ in ( v ) | − N ) − c ( ˆ P ) , where we deﬁne c ( ˆ P ) = (cid:80) v ∈ M ( | ˆΓ in ( v ) | − P . Inparticular, when ˆ P is a partial d -HAG with k intermediate nodes, c ( ˆ P ) = k ( d − d = O (1) is a constant, we see that PartialGreedy (Algorithm 4) can be im-plemented using a polynomial number of maximum weighted-hypergraph matching problems. In particular,when d = 2 or when deg( G ) (the degree of the underlying graph) is constant, we can implement Algorithm 4in polynomial time. Theorem 12.

Suppose that either: • d = 2 , and Algorithm 4 is restricted to a single layer (see Remark 8); or • d = O (1) and deg( G ) = O (1) , where deg( G ) is the maximum degree of the original graph G .Then Algorithm 4 can be implemented in polynomial time.Proof. When d = 2 and PartialGreedy is set to return a single layer graph, the associated hypergraph H is just a graph with at most n vertices and at most kn edges; indeed, there are at most n vertices and k edges for each H r , and H is the disjoint union of the H r over at most n vertices r ∈ R . The problemof maximum weight matching in a graph can be solved using Edmond’s algorithm in time O ( | V | | E | ) for agraph with | V | vertices and | E | edges. Thus, by Lemma 11, the arg max in Line 9 of Algorithm 4 can donein time O ( n k ). Algorithm 4 needs to call this algorithm O ( k · ( n + k ) ) times, for each i = 1 , . . . , k and foreach C ⊆ L ∪ M i − of size d = 2. Thus, the total running time is O ( k ( n + k ) n ).When d > PartialGreedy is set to return a multi-layer graph, then the reduction from Lemma 11yields a weighted hypergraph maximum matching problem, which unfortunately is NP-hard. However, whenthe degree deg( G ) of the underlying graph (and hence of G ) is a constant, then this decomposes into n weighted hypergraph maximum matching problems, one for each H r , and the number of vertices in H r is | Γ in ( r ) | ≤ deg( G ) = O (1). Therefore we can solve a maximum weighted hypergraph problem in H r by bruteforce in time O (1). There are at most n such problems, one for each r , and as above we solve each of themat most k · ( n + k ) d times, yielding a running time of O ( k · n · ( n + k ) d ), where the O ( · ) notation is hidingdependence on deg( G ). 8 B C DA , B , C , DA , B B , C C , D (a) A B C DA , B , C , DA , B B , C C , DA ⊕ B B ⊕ C C ⊕ D (b) A B C DA , B , C , DA , B B , C C , DA ⊕ B B ⊕ C C ⊕ D (c) Figure 2: Example of a GNN computation graph (a) demonstrating that

FullGreedy cannot do betterthan a 1/2 approximation (in this example, for k = 3) and that PartialGreedy can strictly outperform

FullGreedy . The algorithm

FullGreedy will arrive at the solution shown in (b) by choosing the internalnodes in the order indicated: B ⊕ C , A ⊕ B , C ⊕ D . The optimal solution for k = 3 is shown in (c). Thesolution from FullGreedy in (b) achieves a value of 1 while the optimal solution has a value of 2. Thereis a strict separation between

FullGreedy and

PartialGreedy because

PartialGreedy will reach thesolution in (c) even if it chooses the internal nodes in the same order as

FullGreedy shown in (b).

In this section, we consider the approximation guarantees that can be obtained by

FullGreedy and

Par-tialGreedy . We begin with

FullGreedy . The work [4] introduced

FullGreedy and claimed that it gives a 1 − /e approximation, in the sense that value( ˆ G greedy ) ≥ (cid:0) − e (cid:1) value( ˆ G opt ), where ˆ G opt is the HAG computationgraph of maximum value. Unfortunately, as the example in Figure 2 shows, this is not correct, and wecannot hope for better than a 1 / FullGreedy (Algorithm 3), in the single-layer case. Our main theorem isthe following.

Theorem 13.

For a d -HAG computation graph ˆ G with k internal nodes, deﬁne (cid:93) value( ˆ G ) := value( ˆ G ) + k ( d − . Then the single-layer d -HAG computation graph returned by FullGreedy (Algorithm 3, with a restrictionto single-layer; see Remark 8) ˆ G greedy satisﬁes (cid:93) value( ˆ G greedy ) ≥ d (cid:18) − e (cid:19) (cid:93) value( ˆ G ∗ ) , where ˆ G ∗ is the d -HAG computation graph with the largest value (and also the largest (cid:93) value ). Unfortunately, we are not able to establish an approximation ratio for the function value( · ) itself, althoughwe conjecture that a similar result holds.The idea of the proof—which we give below in Section 5.3—is as follows. It is a standard result thatgreedy algorithms for submodular functions achieve a 1 − /e approximation ratio; this was the approachtaken by [4]. Unfortunately, the FullGreedy objective function is not technically submodular, since theorder of the inputs matters, and this prevents the 1 − /e approximation result from being true. However, wecan use the connection to hypergraph matching developed in Lemma 11 in order to translate the objectivefunction of FullGreedy to an objective function where the order does not matter, at the cost of a factorof d . This results in a d (1 − /e ) approximation ratio for (cid:93) value.9 B C DA , B , C , DA , B B , C A , D (a) A B C DA , B , C , DA , B B , C A , DA ⊕ B (b) A B C DA , B , C , DA , B B , C A , DA ⊕ B B ⊕ C (c) A B C DA , B , C , DA , B B , C A , DB ⊕ C A ⊕ D (d) A B C DA , B , C , DA , B B , C A , DA ⊕ B A ⊕ D (e) A B C DA , B , C , DA , B B , C A , DA ⊕ B B ⊕ C A ⊕ D (f) Figure 3: Example of a GNN computation graph (a) demonstrating that

PartialGreedy cannot do betterthan a 1/2 approximation (in this example, for k = 2), and that the objective function for PartialGreedy is not submodular. The algorithm

PartialGreedy will arrive at the solution in (c) by choosing A ⊕ B (asshown in (b)) and then B ⊕ C , arriving at a value of 1. The optimal solution for k = 2 is shown in (d), andhas a value of 2. We see from (e) and (f) that the objective function for PartialGreedy is not submodular,in the sense that adding an internal node A ⊕ D is more valuable after A ⊕ B and B ⊕ C have been addedthan when just A ⊕ B has been added. In (b) the value is 1 and adding A ⊕ D to get (e) leaves the value at1. In (c) the value is 1 and adding A ⊕ B to get (f) increases the value to 2.10 .2 PartialGreedy can strictly outperform FullGreedy We ﬁrst observe by example that

PartialGreedy also cannot achieve an approximation ratio better than1 /

2: the example is given in Figure 3. Notice that this example also shows that the objective function that

PartialGreedy is greedily optimizing is not submodular.However, we also show by example that there are graphs for which

PartialGreedy is strictly betterthan

FullGreedy . Indeed, an example is shown in Figure 2.Thus, the algorithm that runs both

FullGreedy and

PartialGreedy and takes the better of the twoachieves at least the approximation guarantee of Theorem 13, and can sometimes do strictly better than

FullGreedy . In this section we prove Theorem 13. Since we are consider single-layer graphs, we can simplify the notationsomewhat. Let A d = { s ⊂ L : | s | = d } be the set of subsets of size d ; we will associate each such subset with a possible intermediate node m ∈ M ,so that ˆΓ in ( m ) = s . Let B d,k = {{ s , s , . . . , s k } : s i ∈ A d ∀ i } be the collection of all ways to choose k sets s ∈ A d . Thus, an element S = { s , s , . . . , s k } ∈ B d,k representsa set of possible solutions to the single-layer d -HAG problem, where the intermediate nodes are m , . . . , m k so that ˆΓ in ( m i ) = s i . Remark 14.

With the above connection in mind, we will abuse notation and say that “ ˆ P is the partialHAG computation graph induced by S = { s , . . . , s k } and G ,” when we mean that ˆ P is induced by a HAGcomputation graph ˆ G that is equivalent to G and whose intermediate nodes m , . . . , m k have ˆΓ in ( m i ) = s i . We ﬁrst deﬁne the sequence of HAG graphs chosen by this algorithm.

Deﬁnition 15.

Let G be a GNN computation graph. Let s , s , . . . , s k ∈ B d,k . Deﬁne the greedy d-HAGsequence of HAG computation graphs ˆ G , . . . , ˆ G k to be the sequence of graphs that arise when we greedilyassign edges between M and R while inserting the internal nodes corresponding to s , . . . , s k in that order.That is, we deﬁne ˆ G = G , and given ˆ G i − = ( L ∪ M i − ∪ R, ˆ E i − ) , we recursively deﬁne ˆ G i as follows.Let ˆ G (cid:48) i = ( L ∪ M i ∪ R, ˆ E (cid:48) i ) , where M i = M i − ∪ { v i } , and ˆ G (cid:48) i = ˆ E i − ∪ { ( u, v i ) | u ∈ s i } . Now let ˆ P i denotethe partial HAG computation graph induced by ˆ G (cid:48) i and G (as per Deﬁnition 9), and deﬁne ˆ G i = arg max ˆ G∈S ( ˆ P , G ) T ˆ G i − ( M i − ,R ) ⊆ T ˆ G ( M i ,R ) value( ˆ G ) . Remark 16.

Let ˆ G i = ( L ∪ M i ∪ R, ˆ E i ) be the i th graph in the greedy d-HAG sequence. Then we obtain ˆ G i from ˆ G i − by (a) adding an internal vertex v i with ˆΓ in ( v i ) = s i , and (b) for each r ∈ R , greedily adding theedge ( v i , r ) if we can; that is, if cover( v i ) ⊆ (ˆΓ in ( r ) ∩ L ) . (And if we do that, we remove any edges between cover( v i ) and r ). Before we proceed, we set some notation that will be helpful for the rest of the proof.

Deﬁnition 17.

We will denote a length- i ordered sequence ( s , . . . , s i ) ∈ A id by (cid:126)s i . Throughout, S ∗ ∈ B d,k will denote an element corresponding to an optimal solution ˆ G ∗ to the single-layer d -HAG problem; that is,the intermediate nodes M of an optimal solution ˆ G ∗ deﬁne S ∗ by S ∗ = { ˆΓ in ( m ) : m ∈ M } . We will orderthe elements of S ∗ arbitrarily as ( s ∗ , . . . , s ∗ k ) , and denote a preﬁx ( s ∗ , . . . , s ∗ i ) by (cid:126)s ∗ . We will use ( (cid:126)s, (cid:126)s (cid:48) ) todenote concatenation e.g. ( (cid:126)s i , (cid:126)s ∗ i ) = ( s , ..., s i , s ∗ , ..., s ∗ i ) . With this notation, we have the following deﬁnition.11 eﬁnition 18.

For some GNN computation graph G = ( V , E ) , we deﬁne the functions h : A kd → Z + and f : B k,d → Z + as follows. The ordered matching value function h is deﬁned as h ( { s i } ji =1 ) = ( d − j (cid:88) i =1 | ˆΓ ( j ) out ( m i ) | , where ˆ G , ..., ˆ G k is the additive greedy d-HAG sequence, ˆΓ ( j ) out ( m i ) is the out-neighborhood of m i in ˆ G j , and m i is the vertex in M in ˆ G j with Γ in ( m i ) = s i . Now let ˆΓ ( j ) out be deﬁned with respect to the graph ˆ G = ( L, M, R, ˆ E ) that is the maximum value HAG computation graph in S ( ˆ P , G ) , so that ˆ P is the partial HAG computationgraph induced by S j and ˆ G (c.f. Remark 14), and let M = { m , . . . , m j } . Then the maximum matchingvalue function f is deﬁned as f ( S j ) = ( d − j (cid:88) i =1 | ˆΓ ( j ) out ( m i ) | . The functions h and f are related by an additive term of ( d − k to the values of various graphs, asshown below in Lemma 19. We use them instead of these values, because as per Lemma 11, we will see thatthey correspond directly to the size of the matchings in a hypergraph. Lemma 19.

Let G be a GNN computation graph. For any (cid:126)s j ∈ A jd , let ˆ G j be the j th graph in the greedyd-HAG sequence deﬁned by (cid:126)s j and G . Let ˆ G ∗ be the maximum-value element of S ( ˆ P , G ) , where ˆ P is thepartial d-HAG graph induced by S j = { s , . . . , s j } ∈ B d,j (c.f. Remark 14). Then value( ˆ G j ) = h ( (cid:126)s j ) − ( d − j and value( ˆ G ∗ ) = f ( S j ) − ( d − j. In particular, h ( (cid:126)s j ) = (cid:93) value( ˆ G j ) and f ( S j ) = (cid:93) value( ˆ G ∗ ) .Proof. For the ﬁrst expression, let ˆΓ out ( m i ) denote the out-neighborhood of m i in ˆ G j , where m i is the vertexin M in ˆ G j with Γ in ( m i ) = s i . Then using the fact that | ˆΓ in ( m i ) | = d and cover( m i ) = ˆΓ in ( m i ) for all i ,(recall that we are working in a single-layer d -HAG) we havevalue( ˆ G j ) = j (cid:88) i =1 ( | ˆΓ in ( m j ) | − | ˆΓ out ( m j ) | − j (cid:88) i =1 (cid:104) | ˆΓ out ( m i ) | · d − (cid:16) | ˆΓ out ( m i ) | + d − (cid:17)(cid:105) = j (cid:88) i =1 (cid:104) | ˆΓ out ( m i ) | · ( d − − ( d − (cid:105) = ( d − j (cid:88) i =1 (cid:104) | ˆΓ out ( m i ) | − (cid:105) = ( d − j (cid:88) i =1 (cid:104) | ˆΓ out ( m i ) | (cid:105) − ( d − j = h ( (cid:126)s j ) − ( d − j. Similarly, let ˆΓ ∗ out be with respect to the graph ˆ G ∗ . Then again using that | ˆΓ ∗ in ( m i ) | = d for all i , we havevalue( ˆ G ) = j (cid:88) i =1 (cid:104) | ˆΓ ∗ out ( m i ) | · d − (cid:16) | ˆΓ ∗ out ( m i ) | + d − (cid:17)(cid:105) = ( d − j (cid:88) i =1 (cid:104) | ˆΓ ∗ out ( m i ) | (cid:105) − ( d − j = f ( { s , . . . , s j } ) − ( d − j. bservation 20. The function f is monotone.Proof. By Lemma 19 it suﬃces to show that max ˆ G∈S ( ˆ P , G ) value( ˆ G )does not decrease when ˆ P goes from being the partial d-HAG graph induced by { s , . . . , s j } to the partiald-HAG graph induced by { s , . . . , s j , s j +1 } . This is true because the set S ( ˆ P , G ) only grows larger with thischange, and so the maximum is being taken over a larger set. Lemma 21.

Let G be a GNN graph. Let S t ∈ B d,k . Then for any ordering (cid:126)s t = s , ..., s t of S t : d · f ( S t ) ≤ h ( (cid:126)s t ) ≤ f ( S t ) Proof.

Let ˆ P be the partial d-HAG graph induced by S t and G (c.f. Remark 14), and let S = S ( ˆ P , G ). Letˆ G , . . . , ˆ G t be the greedy d-HAG sequence deﬁned by (cid:126)s t and G . Let H ( i ) be the hypergraph associated withˆ G i = ( L ∪ M i ∪ R, ˆ E i ) as in Deﬁnition 10. Consider the bijection ϕ from (the proof of) Lemma 11, and let N ( i ) be a matching in H ( i ) , so that ϕ ( N ( i ) ) = ˆ G i . Recall that the matching N ( i ) can be decomposed intomatchings N ( i ) r , each on the graph H r from Deﬁnition 10. In more detail, the proof of Lemma 11 shows thatthe hyperedge ( s j ∩ Γ in ( r )) is in N ( i ) r if and only if the edge ( m j , r ) is in ˆ G i .First, we observe by Lemma 11 and Lemma 19 that for any i ≤ t and for any (cid:126)s i ∈ A id , h ( (cid:126)s i ) = ( d − · (cid:88) r |N ( i ) r | = value( N ( i ) ) , (2)where the value on the right hand side represents the (weighted) value of the matching. (Notice that sincewe are looking at the single-layer d-HAG problem, all weights are equal to d − N ∗ be such that ϕ ( N ∗ ) = ˆ G ∗ , where ˆ G ∗ is the maximum-value element of S ( ˆ P , G ) where ˆ P is induced by S t . Lemma 11 implies that N ∗ is a maximum hypergraph matching for H ( t ) . As above, by thedeﬁnition of H , N ∗ decomposes into matchings N ∗ r of H ( t ) r for each r ∈ R . Then for S t ∈ B d,t , Lemma 11and Lemma 19 imply that f ( S t ) = ( d − · (cid:88) r |N ∗ r | = value( N ∗ ) . (3)Now consider the change from N ( i ) r to N ( i +1) r . When we pass from H ( i ) to H ( i +1) , we add a hyperedge e r := s i ∩ Γ in ( r ) to each graph H ( i ) r . The hyperedge e r is added to the matching N ( i +1) r if and only if it canbe: that is, if and only if it does not intersect s j ∩ Γ in ( r ) for some j < i . This is because of the deﬁnition ofthe correspondence ϕ , and also the observation in Remark 16 about how ˆ G i +1 is created from ˆ G i .Therefore, for any r ∈ R , the matching N ( t ) r can be found by the following algorithm: • Let H ( t ) r be as above. • N (0) r = ∅ • For i = 1 , . . . , t : – If the hyperedge s i ∩ Γ in ( r ) can be added to N (0) r and still form a hypergraph matching of H ( t ) r ,then let N ( i ) r = N ( i ) r ∪ { s i ∩ Γ in ( r ) } .We observe that this is the classical greedy algorithm for maximum hypergraph matching. This algorithmis well-known to achieve an approximation ratio of 1 /d [2]. That is,1 d value( N ∗ ) ≤ value( N ( t ) ) ≤ value( N ∗ ) . By (2) and (3), this implies that 1 d f ( S t ) ≤ h ( (cid:126)s t ) ≤ f ( S t ) , as desired. 13 emma 22. Let S ∗ be as in Deﬁnition 17. Let (cid:126)s ∗ k = ( s ∗ , ..., s ∗ k ) be any order of elements of S ∗ . Let (cid:126)s i = ( s , . . . , s i ) be the nodes added after i steps of FullGreedy . Then h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s ∗ k ) ≥ − d − d h ( (cid:126)s ∗ k ) . Proof.

Letting S i denote the set of elements of (cid:126)s i , we have h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s ∗ k ) ≥ d f ( S i ∪ S ∗ ) − h ( (cid:126)s ∗ k ) ≥ d f ( S ∗ ) − h ( (cid:126)s ∗ k ) ≥ d h ( (cid:126)s ∗ k ) − h ( (cid:126)s ∗ k ) = − d − d h ( (cid:126)s ∗ k )The ﬁrst inequality is an application of Lemma 21. The second inequality follows from f being monotone(Observation 20). The third inequality is because f ( S ∗ ) gives the optimal graph choice given S ∗ , while h ( (cid:126)s ∗ k )gives one option of graph choice given S ∗ . Lemma 23.

Let S ∗ be as in Deﬁnition 17. Let (cid:126)s ∗ k = ( s ∗ , ..., s ∗ k ) be any order of elements of S ∗ . Let (cid:126)s i = ( s , . . . , s i ) be the nodes added after i steps of FullGreedy . Then, we have h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i ) ≤ (cid:18) − k + 1 (cid:19) ( h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − )) . Proof.

For any (cid:126)s and s (cid:48) (cid:96) , let ∆( (cid:126)s, s (cid:48) (cid:96) ) = h (( (cid:126)s, s (cid:48) (cid:96) )) − h ( (cid:126)s ). That is, ∆ is the marginal beneﬁt of adding theintermediate node s (cid:48) (cid:96) on top of the nodes (cid:126)s , assuming that we are greedily attaching all of the edges that wecan.For i ≤ k , we have h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i ) = k (cid:88) j =1 (cid:2) h (( (cid:126)s i , (cid:126)s ∗ j )) − h (( (cid:126)s i , (cid:126)s ∗ j − )) (cid:3) = k (cid:88) j =1 ∆(( (cid:126)s i , (cid:126)s ∗ j − ) , s ∗ j ) ≤ k (cid:88) j =1 ∆( (cid:126)s i , s ∗ j ) , where in the last line we have used the fact that the marginal beneﬁt of adding s ∗ j later is less than addingit earlier. (In this sense, h behaves like a submodular function, except that the order of the inputs to h matters; crucially, the function f , which is deﬁned on sets rather than sequences, is not submodular.) Bythe deﬁnition of FullGreedy , we have ∆( (cid:126)s i , s ∗ j ) ≤ ∆( (cid:126)s i , s i +1 ) for all j , and with the above this impliesthat h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i ) ≤ k (cid:88) j =1 ∆( (cid:126)s i , s i +1 ) = k · ∆( (cid:126)s i , s i +1 ) . Rearranging this, we have ∆( (cid:126)s i , s i +1 ) ≥ k ( h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i )) (4)for any i ≤ k .Furthermore, h (( (cid:126)s i , (cid:126)s ∗ k )) = h ( (cid:126)s i − ) + ∆( (cid:126)s i − , s i ) + k (cid:88) j =1 ∆(( (cid:126)s i , (cid:126)s ∗ j − ) , s ∗ j ) ≤ h ( (cid:126)s i − ) + ∆( (cid:126)s i − , s i ) + k (cid:88) j =1 ∆(( (cid:126)s i − , (cid:126)s ∗ j − ) , s ∗ j ) (5)14here in the second line we have used the fact that∆(( (cid:126)s i , (cid:126)s ∗ j − ) , s ∗ j ) ≤ ∆(( (cid:126)s i − , (cid:126)s ∗ j − ) , s ∗ j )for any j . Thus, we have h (( (cid:126)s i , (cid:126)s ∗ k )) ≤ h (( (cid:126)s i − , (cid:126)s ∗ k )) + ∆( (cid:126)s i − , s i )using the fact the the right hand side above is equal to the second line of (5). Rearranging, this establishes h (( (cid:126)s i − , (cid:126)s ∗ k )) ≥ h (( (cid:126)s i , (cid:126)s ∗ k )) − ∆( (cid:126)s i − , s i ) (6)Plugging (6) into (4), we obtain∆( (cid:126)s i − , s i ) ≥ k ( h (( (cid:126)s i , (cid:126)s ∗ k )) − ∆( (cid:126)s i − , s i ) − h ( (cid:126)s i − ))and rearranging this implies that∆( (cid:126)s i − , s i ) ≥ k + 1 ( h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − )) . (7)Now we have h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i ) = h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − ) − ∆( (cid:126)s i − , s i ) ≤ h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − ) − k + 1 ( h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − ))= (cid:18) − k + 1 (cid:19) ( h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − ))where we have used (7) in the second line.Finally, we can prove Theorem 13. Proof of Theorem 13.

From Lemma 23, we have h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i ) ≤ (cid:18) − k + 1 (cid:19) ( h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − ))so[ h ( (cid:126)s ∗ k ) − h ( (cid:126)s i )] + [ h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s ∗ k )] ≤ (cid:18) − k + 1 (cid:19) [ h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s i − )]= (cid:18) − k + 1 (cid:19) [ h ( (cid:126)s ∗ k ) − h ( (cid:126)s i − )] + (cid:18) − k + 1 (cid:19) [ h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s ∗ k )] . Rearranging, this implies that h ( (cid:126)s ∗ k ) − h ( (cid:126)s i ) ≤ (cid:18) − k + 1 (cid:19) [ h ( (cid:126)s ∗ k ) − h ( (cid:126)s i − )] − h (( (cid:126)s i , (cid:126)s ∗ k )) − h ( (cid:126)s ∗ k ) k + 1Using Lemma 22, we see that, for all i , h ( (cid:126)s ∗ k ) − h ( (cid:126)s i ) ≤ (cid:18) − k + 1 (cid:19) [ h ( (cid:126)s ∗ k ) − h ( (cid:126)s i − )] + h ( (cid:126)s ∗ k ) k + 1 · d − d (8)Now suppose by induction that h ( (cid:126)s ∗ k ) − h ( (cid:126)s i − ) ≤ (cid:32) d (cid:32)(cid:18) − k + 1 (cid:19) i − − (cid:33)(cid:33) h ( (cid:126)s ∗ k )15ataset Facebook Amazon Email-EUMean value for single-layer HAG 8636.09 1800.73 3088.73Mean value for multi-layer HAG 8945.83 1806.29 3260.11Mean % improvement for multi-layer HAG 3.2% 0.22% 4.9%Std. dev. of % improvement for multi-layer HAG 1.02782 0.216026 1.674153Table 1: The improvement of multi-layer over single-layer for FullGreedy on real-world datasets averagedover k = 1 , ..., i = 1 clearly holds. Plugging this inductive hypothesis into (8), h ( (cid:126)s ∗ k ) − h ( (cid:126)s i ) ≤ (cid:18) − k + 1 (cid:19) [ h ( (cid:126)s ∗ k ) − h ( (cid:126)s i − )] + h ( (cid:126)s ∗ k ) k + 1 · d − d ≤ (cid:18) − k + 1 (cid:19) (cid:32) d (cid:32)(cid:18) − k + 1 (cid:19) i − − (cid:33)(cid:33) h ( (cid:126)s ∗ k ) + h ( (cid:126)s ∗ k ) k + 1 · d − d = (cid:32) d (cid:32)(cid:18) − k + 1 (cid:19) i − (cid:33)(cid:33) h ( (cid:126)s ∗ k ) , which establishes the inductive hypothesis for i . By induction, we conclude that h ( (cid:126)s ∗ k ) − h ( (cid:126)s k ) ≤ (cid:32) d (cid:32)(cid:18) − k + 1 (cid:19) k − (cid:33)(cid:33) h ( (cid:126)s ∗ k ) ≤ (cid:18) d (cid:18) e − (cid:19)(cid:19) h ( (cid:126)s ∗ k ) . Rearranging, we have h ( (cid:126)s k ) ≥ d (cid:18) − e (cid:19) h ( (cid:126)s ∗ k ) , as desired. We ﬁrst show that multi-layer HAG graphs do not have a signiﬁcantly higher value for small k compared tosingle-layer HAG graphs; this justiﬁes our focus on single-layer HAG graphs in Theorem 13. We compared FullGreedy single-layer and multi-layer results for three datasets: a Facebook dataset [8], an Amazonco-purchases dataset [6] (the subset from March 2nd, 2003), and the Email-EU dataset [7] . On averageover k = 1 , ..., . . . FullGreedy and

PartialGreedy perform compared to the optimalsingle-layer solution (computing the optimum is only tractable for limited graph parameters even in thesingle-layer case, so we did not implement it for multi-layer HAGs). Figure 4 shows the quantity 1 − α ,where α is the approximation ratio value( ˆ G greedy ) / value( ˆ G opt ), where ˆ G greedy is the solution returned by for FullGreedy and

PartialGreedy , and ˆ G opt is the optimal solution, for Erd˝os-R´enyi graphs G ( n, p ) with n = 15 and various values of p . Higher values of p result in approximation ratios slightly further from 1 forboth k = 2 and k = 3, although in all experiments the approximation ratios are quite close to 1 for bothalgorithms. While

FullGreedy and

PartialGreedy are much faster in practice than computing the optimal solution,they are still computationally intensive for large values of k and large datasets. In this section we describe All three of these datasets can be found at snap.stanford.edu/data a) (b) Figure 4: We compare

FullGreedy and

PartialGreedy to the optimal HAG computation graph on aset of 50 Erd˝os-R´enyi graphs G ( n, p ) with n = 15. The y -axis plots average values of 1 − α , where α is theapproximation ratio. The x -axis plots the parameter p . Shown are (a) k = 2 and (b) k = 3. DegreeHeuristic vs.

FullGreedy HubHeuristic vs.

FullGreedy

Dataset Value Ratio Runtime Ratio Value Ratio Runtime RatioAmazon 0.0699 0.123 0.629 0.124Email-EU 0.558 0.0548 0.410 0.107Facebook 0.376 0.0408 0.313 0.0894Table 2: For each dataset,

FullGreedy , DegreeHeuristic and

HubHeuristic were run 10 times with k = 100. Value Ratio is computed as the value of the DegreeHeuristic result divided by the valueof the

FullGreedy result for the ﬁrst column and the value of

HubHeuristic result divided by valueof

FullGreedy for the third column. Runtime Ratio is computed in the same way to compare the twoheuristics to

FullGreedy .two alternative heuristics,

DegreeHeuristic and

HubHeuristic , which only achieve a fraction of thevalue of

FullGreedy , but compute the HAG computation graph signiﬁcantly faster.

DegreeHeuristic starts by ranking all of the vertices of the input graph G = ( V, E ) by degree: { v i } ni =1 with Γ out ( v i ) ≥ Γ out ( v i +1 ) for i = 1 , ..., n . It then takes the top k adjacent pairs of the sequence (i.e.,( v , v ) , ( v , v ) , . . . , ( v k − , v k )) as the covers of the k aggregation nodes and constructs a single-layer 2-HAG computation graph. The out-edges of the aggregation nodes are assigned greedily in the same coverorder ( v , v ) , ( v , v ) , ... based on degree. We compare this heuristic to FullGreedy for value and runtimein Table 2. This method performs decently on the Facebook and Email-EU datasets, and signiﬁcantly worseon the Amazon purchasing network. We conjecture that this is because the Amazon network has has asigniﬁcantly lower average degree (about 2.8) than the other two sets (about 22 for Facebook and 25 forEmail-EU).

HubHeuristic is based on searching for “good” intermediate aggregation nodes around high-degreenodes of G . This algorithm is motivated by the frequency with which triangles appear in real-datasets. HubHeuristic also starts by ranking the vertices from highest to lowest degree as { v i } ni =1 . Then for v , ..., v k the heuristic does the following: for each u ∈ Γ in ( v i ), compute the value of adding aggregationnode with cover { v i , u } . Then a new node m is added with cover { v i , u } using the u that allows for maximalout-edges from m . This process is repeated for v , .., v k in order, so it is greedy in the sense that out neighborsof previous aggregation nodes remain the same during subsequent iterations. We compare HubHeuristic to FullGreedy for value and runtime, shown in Table 2.In this paper we have analyzed the optimization problem that arises from

Hierarchical Aggregation (HAG),as introduced by [4] for speeding up learning on GNNs. We showed that

FullGreedy , the algorithmproposed by [4], cannot do better than a 1/2 approximation. We also described a second greedy algorithm,

PartialGreedy , which can actually be implemented eﬃciently for some parameters, and can obtain results17trictly better than

FullGreedy . We also showed that

FullGreedy achieves a d (1 − /e ) approximationratio for a related objective function where d is the in-degree of the intermediate aggregation nodes.Next, we showed empirically that single-layer HAGs achieve nearly the same value as multi-layer HAGsand FullGreedy and

PartialGreedy both get fairly close to the optimal value on small synthetic graphs.Finally, we deﬁned two additional greedy heuristics,

DegreeHeuristic and

HubHeuristic , and showedthat they can achieve about a third to a half of the value of

FullGreedy in a tenth or less of the runtime.Our work suggests many interesting future directions, including pinning down the approximation ratiofor both

FullGreedy and

PartialGreedy , and proving approximation guarantees for the heuristics

DegreeHeuristic and

HubHeuristic in terms of the characteristics of the graph.

Acknowledgements

We thank Zhihao Jia, Rex Ying, and Jure Leskovec for helpful conversations.

References [1] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graphembedding: Problems, techniques, and applications.

IEEE Transactions on Knowledge and Data Engi-neering , 30(9):1616–1637, 2018.[2] Barun Chandra and Magn´us M Halld´orsson. Greedy local improvement and weighted set packingapproximation.

Journal of Algorithms , 39(2):223–240, 2001.[3] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In

Advances in Neural Information Processing Systems , pages 1024–1034, 2017.[4] Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, and Alex Aiken. Redundancy-free com-putation graphs for graph neural networks. arXiv preprint arXiv:1906.03707 , 2019.[5] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. arXiv preprint arXiv:1609.02907 , 2016.[6] Jure Leskovec, Lada A Adamic, and Bernardo A Huberman. The dynamics of viral marketing.

ACMTransactions on the Web (TWEB) , 1(1):5–es, 2007.[7] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densiﬁcation and shrinkingdiameters.

ACM transactions on Knowledge Discovery from Data (TKDD) , 1(1):2–es, 2007.[8] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In

NIPS ,volume 2012, pages 548–56. Citeseer, 2012.[9] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. Thegraph neural network model.

IEEE transactions on neural networks , 20(1):61–80, 2008.[10] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[11] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec.Hierarchical graph representation learning with diﬀerentiable pooling. arXiv preprint arXiv:1806.08804arXiv preprint arXiv:1806.08804