Augmented Sparsifiers for Generalized Hypergraph Cuts
AAugmented Sparsifiers for Generalized Hypergraph Cuts ∗ Austin R. BensonComputer Science Dept.Cornell [email protected] Jon KleinbergComputer Science Dept.Cornell [email protected] Nate VeldtCenter for Applied MathCornell [email protected]
Abstract
In recent years, hypergraph generalizations of many graph cut problems and algorithms havebeen introduced and analyzed as a way to better explore and understand complex systems anddatasets characterized by multiway relationships. The standard cut function for a hypergraph H = ( V, E ) assigns the same penalty to a cut hyperedge, regardless of how its nodes are separatedby a partition of V . Recent work in theoretical computer science and machine learning has madeuse of a generalized hypergraph cut function that can be defined by associating each hyperedge e ∈ E with a splitting function w e , which assigns a (possibly different) penalty to each way ofseparating the nodes of e . When each w e is a submodular cardinality-based splitting function ,meaning that w e ( S ) = g ( | S | ) for some concave function g , previous work has shown that ageneralized hypergraph cut problem can be reduced to a directed graph cut problem on anaugmented node set. However, existing reduction procedures introduce up to O ( | e | ) edges fora hyperedge e . This often results in a dense graph, even when the hypergraph is sparse, whichleads to slow runtimes (in theory and practice) for algorithms that run on the reduced graph.We introduce a new framework of sparsifying hypergraph-to-graph reductions, where a hy-pergraph cut defined by submodular cardinality-based splitting functions is (1+ ε )-approximatedby a cut on a directed graph. Our techniques are based on approximating concave functionsusing piecewise linear curves, and we show that they are optimal within an existing strategy forhypergraph reduction. We provide bounds on the number of edges needed to model differenttypes of splitting functions. For ε >
0, in the worst case, we need O ( ε − | e | log | e | ) edges toreduce any hyperedge e , which leads to faster runtimes for approximately solving generalizedhypergraph s - t cut problems. For the common machine learning heuristic of a clique split-ting function on a node set e , our approach requires only O ( | e | ) nodes and O ( | e | ε − / log log ε )edges, instead of the O ( | e | ) edges used with existing reductions. Equivalently, we can modelthe cut properties of a complete graph on n nodes using O ( n ) nodes and O ( nε − / log log ε )directed and weighted edges. This sparsification leads to faster approximate min s - t graph cutalgorithms for certain classes of co-occurrence graphs that are represented implicitly by a collec-tion of sets modeling co-occurrences. Finally, we apply our sparsification techniques to developthe first approximation algorithms for approximately minimizing sums of cardinality-based sub-modular functions, which arise in numerous machine learning and computer vision applications,producing faster algorithms in a number of settings. ∗ This research was supported by NSF Award DMS-1830274, ARO Award W911NF19-1-0057, ARO MURI, JP-Morgan Chase & Co., a Simons Investigator Award, a Vannevar Bush Faculty Fellowship, and a grant from theAFOSR. The authors thank Pan Li for helpful conversations about decomposable submodular function minimization. a r X i v : . [ c s . D S ] J u l Introduction
Hypergraphs are a generalization of graphs in which nodes are organized into multiway relationshipscalled hyperedges. Given a hypergraph H = ( V, E ) and a set of nodes S ⊆ V , a hyperedge e ∈ E issaid to be cut by S if both S and ¯ S = V \ S contain at least one node from e . Developing efficientalgorithms for cut problems in hypergraphs is an active area of research in theoretical computerscience [15–17, 23, 36], and has been applied to problems in VLSI layout [4, 28, 35], sparse matrixpartitioning [2, 6], and machine learning [44, 46, 67].Here, we consider recently introduced generalized hypergraph cut functions [44,46,66,70], whichassign different penalties to cut hyperedges based on how the nodes of a hyperedge are split intodifferent sides of the bipartition induced by S . To define a generalized hypergraph cut function,each hyperedge e ∈ E is first associated with a splitting function w e : A ⊆ e → R + that mapseach node configuration of e (defined by the subset A ⊆ e in S ) to a nonnegative penalty. In orderto mirror edge cut penalties in graphs, splitting functions are typically assumed to be symmetric( w e ( A ) = w e ( e \ A )) and only penalize cut hyperedges (i.e., w e ( ∅ ) = 0). The generalized hypergraphcut function for a set S ⊆ V is then given by cut H ( S ) = (cid:88) e ∈E w e ( S ∩ e ) . (1)The standard hypergraph cut function is all-or-nothing , meaning it assigns the same penalty toa cut hyperedge regardless of how its nodes are separated. Using the splitting function terminology,this means that w e ( A ) = 0 if A ∈ { e, ∅} , and w e ( A ) = w e otherwise, where w e is a scalar hyperedgeweight. One particularly relevant class of splitting function are submodular functions, which for all A , B ⊆ e satisfy w e ( A ) + w e ( B ) ≥ w e ( A ∩ B ) + w e ( A ∪ B ). When all hyperedge splitting functionsare submodular, solving generalized hypergraph cut problems is closely related to minimizing adecomposable submodular function [20, 21, 39, 45, 54, 64], which in turn is closely related to energyminimization problems often encountered in computer vision [24, 37, 38]. The standard graph cutfunction is another well-known submodular special case of (1).One of the most common techniques for solving hypergraph cut problems is to reduce thehypergraph to a graph sharing similar (or in some cases identical) cut properties. Arguably themost widely used reduction technique is clique expansion, which replaces each hyperedge with a(possibly weighted) clique [10, 28, 44, 71, 73]. In the unweighted case this corresponds to applyinga splitting function of the form: w e ( A ) = | A | · | e \ A | . Previous work has also explored otherclasses of submodular hypergraph cut functions that can be modeled as a graph cut problem ona potentially augmented node set [24, 37, 38, 41, 66]. This research primarily focuses on provingwhen such a reduction is possible, regardless of the number of edges and auxiliary nodes needed torealize the reduction. However, because hyperedges can be very large and splitting functions maybe very general and intricate, many of these techniques lead to large and dense graphs. Therefore,the reduction strategy significantly affects the runtime and practicality of algorithms that runon the reduced graph. This leads to several natural questions. Are the graph sizes resultingfrom existing techniques inherently necessary for modeling hypergraph cuts? Given a class offunctions that are known to be graph reducible, can one determine more efficient or even the most efficient reduction techniques? Finally, is it possible to obtain more efficient reductions and fasterdownstream algorithms if it is sufficient to just approximately model cut penalties?To answer these questions, we present a novel framework for sparsifying hypergraph-to-graphreductions with provable guarantees on preserving cut properties. Our framework brings togetherconcepts and techniques from several different theoretical domains, including algorithms for solvinggeneralized hypergraph cut problems [44, 46, 66, 70], standard graph sparsification techniques [8,10, 62], and tools for approximating functions with piecewise linear curves [49, 50]. We presentsparsification techniques for a large and natural class of submodular splitting functions that are cardinality-based , meaning that w e ( A ) = w e ( B ) whenever | A | = | B | . These are known to alwaysbe graph reducible, and are particularly natural for several downstream applications [66]. Ourapproach leads to graph reductions that are significantly more sparse than previous approaches,and we show that our method is in fact optimally sparse under a certain type of reduction strategy.Our sparsification framework can be directly used to develop faster algorithms for approximatelysolving hypergraph s - t cut problems [66], and improve runtimes for a large class of cardinality-baseddecomposable submodular minimization problems [33, 37, 39, 64]. We also show how our techniquesenable us to develop efficient sparsifiers for graphs constructed from co-occurrence data. Our framework and results share numerous connections with existing work on graph sparsification,which we review here. Let G = ( V, E ) be a graph with a cut function cut G , which can be viewedas a very restricted case of the generalized hypergraph cut function in Eq. (1). An ε -cut sparsifierfor G is a sparse weighted and undirected graph H = ( V, F ) with cut function cut H , such that cut G ( S ) ≤ cut H ( S ) ≤ (1 + ε ) cut G ( S ) , (2)for every subset S ⊆ V . This definition was introduced by Bencz´ur and Karger [9], who showedhow to obtain a sparsifier with O ( n log n/ε ) edges for any graph in O ( m log n ) time for an n -node, m -edge graph. The more general notion of spectral sparsification, which approximately preservesthe Laplacian quadratic form of a graph rather than just the cut function, was later introduced bySpielman and Teng [61]. The best cut and spectral sparsifiers have O ( n/ε ) edges, which is knownto be asymptotically optimal for both spectral and cut sparsifiers [5, 8]. Although studied muchless extensively, analogous definitions of cut [17, 36] and spectral [60] sparsifiers for hypergraphshave also been developed. However, these apply exclusively to the all-or-nothing cut penalty, anddo not preserve generalized cut functions of the form shown in (1). Bansal et al. [7] also considereda weaker notion of graph and hypergraph sparsification, involving additive approximation terms,but in the present work we only consider multiplicative approximations. In this paper, we introduce an alternative notion of an augmented cut sparsifier. We present ourresults in the context of hypergraph-to-graph reductions, though our framework also provides anew notion of augmented sparsifiers for graphs. Let H = ( V, E ) be a hypergraph with a generalizedcut function cut H , and let ˆ G = ( V ∪ A , ˆ E ) be a directed graph on an augmented node set V ∪ A .The graph is equipped with an augmented cut function defined for any S ⊆ V by cut ˆ G ( S ) = min T ⊆A dircut ˆ G ( S ∪ T ) , (3)where dircut ˆ G is the standard directed cut function on ˆ G . We say that ˆ G is an ε - augmented cutsparsifier for H if it is sparse and satisfies cut H ( S ) ≤ cut ˆ G ( S ) ≤ (1 + ε ) cut H ( S ) . (4)The minimization involved in (3) is especially natural when the goal is to approximate a minimumcut or minimum s - t cut in H . If we solve the corresponding cut problem in ˆ G , nodes from the2 uxiliary node set A will be automatically arranged in a way that yields the minimum directed cutpenalty, as required in (3). If ˆ S ∗ is the minimum cut in ˆ G , S ∗ = V ∩ ˆ S ∗ will be a (1+ ε )-approximateminimum cut in G . Even when solving a minimum cut problem is not the goal, our sparsifiers willbe designed in such a way that the augmented cut function (3) will be easy to evaluate.Unlike the standard graph sparsification problem, in some cases it may in fact be impossibleto find any directed graph ˆ G satisfying (4), independent of the graph’s density. In recent work weshowed that hypergraphs with non-submodular splitting functions are never graph reducible [66].ˇZivn´y et al. [74] showed that even in the case of four-node hyperedges, there exist submodular splitting functions (albeit asymmetric splitting functions) that are not representable by graph cuts.Nevertheless, there are several special cases in which graph reduction is possible [24, 37, 38]. Augmented Sparsifiers for Cardinality-Based Hypergraph Cuts
We specifically considerthe class of submodular splitting functions that are cardinality-based, meaning they satisfy w e ( A ) = w e ( B ) whenever A, B ⊆ e satisfy | A | = | B | . These are known to be graph reducible [37,66], thoughexisting techniques will reduce a hypergraph H = ( V, E ) to a graph with O ( | V | + (cid:80) e ∈E | e | ) nodesand O ( (cid:80) e ∈E | e | ) edges. We prove the following sparse reduction result. Theorem 1.1.
Let H = ( V, E ) be a hypergraph where each e ∈ E is associated with a cardinality-based submodular splitting function. There exists an augmented cut sparsifier ˆ G for H with O ( | V | + ε (cid:80) e ∈E log | e | ) nodes and O ( ε (cid:80) e ∈ E | e | log | e | ) edges. For certain types of splitting functions (e.g., the one corresponding to a clique expansion), weshow that our reductions are even more sparse.
Augmented Sparsifiers for Graphs
Another relevant class of augmented sparsifiers to consideris the setting where H is simply a graph. In this case, if A is empty and all edges are undirected,condition (4) reduces to the standard definition of a cut sparsifier. A natural question is whetherthere exist cases where allowing auxiliary nodes and directed edges leads to improved sparsifiers.We show that the answer is yes in the case of dense graphs constructed from co-occurrence data. Augmented Spectral Sparsifiers
Just as spectral sparsifiers generalize cut sparsifiers in thestandard graph setting, one can define an analogous notion of an augmented spectral sparsifierfor hypergraph reductions. This can be accomplished using existing hypergraph generalizations ofthe Laplacian operator [14,46,47,70]. However, although developing augmented spectral sparsifiersconstitutes an interesting open direction for future research, it is unclear whether the techniques wedevelop here can be used or adapted to spectrally approximate generalized hypergraph cut functions.We include further discussion on hypergraph Laplacians and spectral sparsifiers in Section 7, andpose questions for future work. Our primary focus in this manuscript is to develop techniques foraugmented cut sparsifiers.
Graph reduction techniques work by replacing a hyperedge with a small graph gadget modeling thesame cut properties as the hyperedge splitting function. The simplest example of a graph reduciblefunction is the quadratic splitting function, which we also refer to as the clique splitting function: w e ( S ) = | A | · | e \ A | , for A ⊆ e . (5)3
234 5 6 v e (a) Star gadget (b) Clique gadget e e a aa aa aa aa aa aa · b (c) CB-gadget Figure 1: Three gadgets, each modeling a different hyperedge splitting function.This function can be modeled by replacing a hyperedge with a clique (Figure 1b). Another functionthat can be modeled by a gadget is the linear penalty, which can be modeled by a star gadget [73]: w e ( S ) = min {| A | , | e \ A |} , for A ⊆ e . (6)A star gadget (Figure 1a) contains an auxiliary node v e for each e ∈ E , which is attached to each v ∈ e with an undirected edge. In order to model the broader class of submodular cardinality-based splitting functions, we previously introduced the cardinality-based gadget [66] (CB-gadget)(Figure 1c). This gadget is parameterized by positive scalars a and b , and includes two auxiliarynodes e (cid:48) and e (cid:48)(cid:48) . For each node v ∈ e , there is a directed edge from v to e (cid:48) and a directed edgefrom e (cid:48)(cid:48) to v , both of weight a . Lastly, there is a directed edge from e (cid:48) to e (cid:48)(cid:48) of weight a · b . ThisCB-gadget corresponds to the following splitting function: w a,b ( A ) = a · min {| A | , | e \ A | , b } . (7)Every submodular, cardinality-based (SCB) splitting function can be modeled by a combination of CB-gadgets with different edge weights [66]. A different reduction strategy for minimizingsubmodular energy functions with cardinality-based penalties was also previously developed byKohli et al. [37]. Both techniques require up to O ( k ) directed edges for a k -node hyperedge. Sparse Combinations of CB-gadgets
Our work introduces a new framework for approximately modeling submodular cardinality-based (SCB) splitting functions using a small combinations ofCB-gadgets. Figure 2 illustrates our sparsification strategy. We first associate an SCB splittingfunction with a set of points { ( i, w i ) } , where i represents the number of nodes on the “small side”of a cut hyperedge, and w i is the penalty for such a split. We show that when many of these pointsare collinear, they can be modeled with a smaller number of CB-gadgets. As an example, thestar expansion penalties (6) can be modeled with a single CB-gadget (Figures 2a and 2d), whereasmodeling the quadratic penalty with previous techniques [66] requires many more (Figures 2band 2e). Given this observation, we design new techniques for ε -approximating the set of points { ( i, w i ) } with a piecewise linear curve using a small number linear pieces. We then show how totranslate the resulting piecewise linear curve back into a smaller combination of CB-gadgets that ε -approximates the original splitting function. Our piecewise linear approximation strategy allowsus to find the optimal (i.e., minimum-sized) graph reduction in terms of CB-gadgets. When ε = 0,our approach finds the best way to exactly model an SCB splitting function, and requires only halfthe number of gadgets needed by previous techniques [66]. More importantly, for larger ε , we provethe following sparse approximation result, which is used to prove Theorem 1.1.4 (a) Star function weights (b) Clique function weights (c) Piecewise linear approx e e (d) CB-gadget for star e e e e e e (e) CB-gadgets for clique e e (f) Approximating the clique Figure 2: (a) The linear splitting function (6) can be modeled by a sparse gadget (d). The quadraticsplitting function (5) penalties (b) can be modeled by a dense gadget (e). A piecewise linearapproximation for the quadratic splitting penalties (c) corresponds to a sparse gadget (f).
Theorem 1.2.
For ε ≥ , any submodular cardinality-based splitting function on a k -node hyper-edge can be ε -modeled by combining O (min { log k/ε, k } ) CB-gadgets.
We show that a nearly matching lower bound of O (log k/ √ ε ) CB-gadgets is required for model-ing a square root splitting function. Despite worst case bounds, we prove that only O ( ε − / log log ε )CB-gadgets are needed to approximate the quadratic splitting function, independent of hyperedgesize. This is particularly relevant for approximating the widely used clique expansion technique,as well as for modeling certain types of dense co-occurrence graphs. All of our sparse reductiontechniques are combinatorial, deterministic, and very simple to use in practice. When H is just a graph, augmented sparsifiers correspond to a generalization of standard cutsparsifiers that allow directed edges and auxiliary nodes. The auxiliary nodes in this case play arole analogous to Steiner nodes in finding minimum spanning trees. Just as adding Steiner nodesmakes it possible to find a smaller weight spanning tree, it is natural to ask whether including anauxiliary node set might lead to better cut sparsifiers for a graph G . We show that the answer is yesfor certain classes of dense co-occurrence graphs, which are graphs constructed by inducing a cliqueon a set of nodes that share a certain property or participate in a certain type of group interaction(equivalently, clique expansions of hypergraphs). Steiner nodes have in fact been previously usedin constructing certain types of sparsifiers called vertex and flow sparsifiers [18]. However, theseare concerned with preserving certain routing properties between distinguished terminal nodes ina graph, and are therefore distinct from our goal of obtaining ε -cut sparsifiers. Sparsifying the complete graph
Our ability to sparsify the clique splitting function (5) directlyimplies a new approach for sparsifying a complete graph. Cut sparsifiers for the complete graph5rovide a simple case study for understanding the differences in sparsification guarantees that canbe obtained when we allow auxiliary nodes and directed edges. Furthermore, better sparsifiers forthe complete graph can be used to design useful sparsifiers for co-occurrence graphs. We have thefollowing result.
Theorem 1.3.
Let G = ( V, E ) be the complete graph on n = | V | nodes. There exists an ε -augmented sparsifier for G with O ( n ) nodes and O ( nε − / log log ε ) edges. By comparison, the best standard cut and spectral sparsifiers for the complete graph haveexactly n nodes and O ( n/ε ) edges. This is tight for spectral sparsifiers [8], as well as for degree-regular cut sparsifiers with uniform edge weights [3]. Thus, by adding a small number of auxiliarynodes, our sparsifiers enable us to obtain a significantly better dependence on ε when cut-sparsifyinga complete graph. Our sparsifier is easily constructed deterministically in O ( nε − / log log ε ) time.Standard undirected sparsifiers for the complete graph have received significant attention as theycorrespond to expander graphs [3, 8, 48, 51]. We remark that the directed augmented cut sparsifierswe produce are very different in nature and should not be viewed as expanders. In particular,unlike for expander graphs, random walks on our complete graph sparsifiers will converge to avery non-uniform distribution. We are interested in augmented sparsifiers for the complete graphsimply for their ability to model cut properties in a different way, and the implications this has forsparsifying hypergraph clique expansions and co-occurrence graphs. Sparsifying co-occurrence graphs
Co-occurrence relationships are inherent in the construc-tion of many types of graphs. Formally, consider a set of n = | V | nodes that are organized intoa set of co-occurrence interactions C ⊆ V . Interaction c ∈ C is associated with a weight w c > i and j is created with weight w ij = (cid:80) c ∈C : i,j ∈ c w c . When w c = 1 forevery c ∈ C , w ij equals the number of interactions that i and j share. We use d avg to denote theaverage number of co-occurrence interactions in which nodes in V participate. The cut value in theresulting graph G = ( V, E ) for a set S ⊆ V is given by the following co-occurrence cut function : cut G ( S ) = (cid:88) c ∈C w c · | S ∩ c | · | ¯ S ∩ c | . (8)Graphs with this co-occurrence cut function arise frequently as clique expansions of a hyper-graph [10, 28, 71, 73], or as projections of a bipartite graph [42, 52, 53, 57, 63, 69, 72]. Even whenthe underlying dataset is not first explicitly modeled as a hypergraph or bipartite graph, manyapproaches implicitly use this approach to generate a graph from data. When enough group in-teraction sizes are large, G becomes dense, even if |C| is small. We can significantly sparsify G byapplying an efficient sparsifier to each clique induced by a co-occurrence relationship. Importantly,we can do this without ever explicitly forming G . By applying Theorem 1.3 as a black-box forclique sparsification, we obtain the following result. Theorem 1.4.
Let G = ( V, E ) be the co-occurrence graph for some C ⊆ V and let n = | V | . For ε > , there exists an augmented sparsifier ˆ G with O ( n + |C| · f ( ε )) nodes and O ( n · d avg · f ( ε )) edges, where f ( ε ) = ε − / log log ε . In particular, if d avg is constant and for some δ > we have (cid:80) c ∈C | c | = Ω( n δ ) , then forming G explicitly takes Ω( n δ ) time, but an augmented sparsifier for G with O ( nf ( ε )) nodes and O ( nf ( ε )) edges can be constructed in O ( nf ( ε )) time. Importantly, the average co-occurrence degree d avg is not the same as the average node degreein G , which will typically be much larger. Theorem 1.4 highlights that in regimes where d avg isa constant, our augmented sparsifiers will have fewer edges than the number needed by standard6able 1: Runtimes for cardinality-based decomposable submodular function minimization, andbounds for special regimes. k = k avg = µ/R is the average hyperedge size. For IBFS when k = Θ( n ), we simply list lower bounds indicating why these methods are not practical in this case. k = Θ( n )Method Runtime k = O (1) n = Ω( R ) R = Ω( n )Kolmogorov SF [39] ˜ O ( R k ) ˜ O ( R ) ˜ O ( R n ) ˜ O ( R n )IBFS Strong [20, 22] O ( n θ max (cid:80) e | e | ) O ( n ) Ω( n ) Ω( n R )IBFS Weak [20, 22] ˜ O ( n θ max + n (cid:80) e | e | ) ˜ O ( n ) Ω( n ) Ω( n (cid:80) e | e | )ACDM [20, 21] ˜ O ( nRk ) ˜ O ( nR ) ˜ O ( n R ) ˜ O ( n R )This paper ˜ O (cid:16) min (cid:110)(cid:0) Rkε (cid:1) , Rkε ( n + Rε ) (cid:111)(cid:17) ˜ O (cid:16)(cid:0) Rε (cid:1) (cid:17) ˜ O (cid:16) R (cid:0) nε (cid:1) (cid:17) ˜ O (cid:16) n (cid:0) Rε (cid:1) (cid:17) ε -cut sparsifiers. In Section 5, we consider simple graph models that satisfy these assumptions. Wealso consider tradeoffs between our augmented sparsifiers and standard sparsification techniquesfor co-occurrence graphs. Independent of the black-box sparsifier we used, implicitly sparsifying G in this way will often lead to significant runtime improvements over forming G explicitly. Typically in hypergraph cut problems it is natural to assume that splitting functions are symmetricand satisfy w e ( ∅ ) = w e ( e ) = 0. However, we show that our sparse reduction techniques apply evenwhen these assumptions do not hold. This allows us to design fast algorithms for approximatelysolving cardinality-based decomposable submodular minimization problems. Formally a function f : 2 V → R + is a decomposable submodular function if it can be written as f ( S ) = (cid:88) e ∈E f e ( S ∩ e ) , (9)where each f e is a submodular function defined on a set e ⊆ V . Following our previous notation andterminology, we say f e is cardinality-based if f e ( S ) = g e ( | S | ) for some concave function g e . Thisspecial case has also received some attention in previous literature on decomposable submodularfunction minimization [33, 37, 39, 64]. Existing approaches for minimizing these functions focuslargely on finding exact solutions. Using our sparse reduction techniques, we develop the first fastalgorithms for approximately solving the problem. Let n = | V | , R = |E| , and µ = (cid:80) e ∈E | e | . InAppendix B, we show that a result similar to Theorem 1.1 also holds for more general cardinality-based splitting functions. In Section 6, we combine that result with the s - t cut solvers of Goldbergand Rao [27] to prove the following theorem. Theorem 1.5.
Let ε > . Any cardinality-based decomposable submodular function can be mini-mized to within a multiplicative (1 + ε ) factor in ˜ O (min { ε − / µ / , ε − µ ( n + ε − R ) / } ) time. We compare this runtime against the best previous techniques for exactly minimizing sumsof cardinality-based submodular functions. We summarize runtimes for competing approaches inTable 1, which includes both strongly polynomial and weakly polynomial methods, the latter ofwhich assume integer-valued functions. We again note that the runtimes for competing approachesare for finding exact minimizers, whereas our approach provides a (1+ ε ) guarantee. Our techniquesenable us to highlight regimes of the problem where we can obtain significantly faster algorithms incases where it is sufficient to solve the problem approximately. For example, whenever n = Ω( R ),7ur algorithms for finding approximate solutions provide a runtime advantage — often a significantone — over approaches for computing an exact solution. A generalized hypergraph cut function is defined as the sum of its splitting functions. Therefore,if we can design a technique for approximately modeling a single hyperedge with a sparse graph,this in turn provides a method for constructing an augmented sparsifier for the entire hypergraph.We now formalize the problem of approximating a submodular cardinality-based (SCB) splittingfunction using a combination of cardinality-based (CB) gadgets. We abstract this as the task ofapproximating a certain class of functions with integer inputs (equivalent to SCB splitting func-tions), using a small number of simpler functions (equivalent to cut properties of the gadgets). Let[ r ] = { , , . . . , r } . Definition 2.1. An r -SCB integer function is a function w : { } ∪ [ r ] → R + satisfying w (0) = 0 (10)2 w ( j ) ≥ w ( j −
1) + w ( j + 1) for j = 1 , . . . , r − ≤ w (1) ≤ w (2) ≤ . . . ≤ w ( r ) (12) We denote the set of r -SCB integer functions by S r . The value w ( i ) represents the splitting penalty for placing i nodes on the small side of a cuthyperedge. In previous work we showed that the inequalities given in Definition 2.1 are necessaryand sufficient conditions for a cardinality-based splitting function to be submodular [66]. The r -SCB integer function for a CB-gadget with edge parameters ( a, b ) (see (7)) is w a,b ( i ) = a · min { i, b } . (13)Combining J CB-gadgets produces a combined r -SCB integer function of special importance. Definition 2.2. An r -CCB ( C ombined C ardinality- B ased gadget) function of order J , is an r -SCBinteger function ˆ w with the form ˆ w ( i ) = J (cid:88) j =1 a j · min { i, b j } , for i ∈ [ r ] . (14) where the t -dimensional vectors a = ( a j ) and b = ( v j ) parameterizing ˆ w satisfy: b j > , a j > for all j ∈ [ J ] (15) b j < b j +1 for j ∈ [ J ] (16) b J ≤ r. (17) We denote the set of r -CCB functions of order J by C Jr . The conditions on the vectors a and b come from natural observations about combining CB-gadgets. Condition (15) ensures that we do not consider CB-gadgets where all edge weights arezero. The ordering in condition (16) is for convenience; the fact that b j values are all distinctimplies that we cannot collapse two distinct CB-gadgets into a single CB-gadget with new weights.8or condition (17), observe that for any b J ≥ r , min { i, b J } = i for all i ∈ [ r ]. For a helpful visual,note that the r -SCB function in (13) represents splitting penalties for the CB-gadget in Figure 1c.An r -CCB function corresponds to a combination of CB-gadgets, as in Figures 2c and 2f.In previous work we showed that any combination of CB-gadgets produces a submodular andcardinality-based splitting function, which is equivalent to stating that C Jr ⊆ S r for all J ∈ N [66].Furthermore, C rr = S r , since any r -SCB splitting function can be modeled by a combination of r CB-gadgets. Our goal here is to determine how to approximate a function w ∈ S r with somefunction ˆ w ∈ C Jr where J (cid:28) r . This corresponds to modeling an SCB splitting function using a small combination of CB-gadgets. Definition 2.3.
For a fixed w ∈ S r and an approximation tolerance parameter ε ≥ , the SparseGadget Approximation Problem ( Spa-GAP ) is the following optimization problem: minimize κ subject to w ≤ ˆ w ≤ (1 + ε ) w ˆ w ∈ C κr . (18) Upper Bounding Approximations
Problem (18) specifically optimizes over functions ˆ w thatupper bound w . This restriction simplifies several aspects of our analysis without any practicalconsequence. For example, we could instead fix some δ ≥ w satisfying δ w ≤ ˜ w ≤ δ w . However, this implies that the function ˆ w = δ ˜ w satisfies w ≤ ˆ w ≤ (1 + ε ) w , with ε = δ −
1. Thus, the problems are equivalent for the correct choice of δ and ε . Motivation for Optimizing over CB-gadgets
A natural question to ask is whether it wouldbe better to search for a sparsest approximating gadget over a broader classes of gadgets. There areseveral key reasons why we restrict to combinations of CB-gadgets. First of all, we already knowthese can model any
SCB splitting function, and thus they provide a very simple building block withbroad modeling capabilities. Furthermore, it is clear how to define an optimally sparse combinationof CB-gadgets: since all CB-gadgets for a k -node hyperedge have the same number of auxiliarynodes and directed edges, an optimally sparse reduction is one with a minimum number of CB-gadgets. If we instead wish to optimize over all possible gadgets, it is likely that the best reductiontechnique will depend on the splitting function that we wish to approximate. Furthermore, theoptimality of a gadget may not even be well-defined, since one must take into account both thenumber of auxiliary nodes as well as the number of edges that are introduced, and the tradeoffbetween the two is not always clear. Finally, as we shall see in the next section, by restricting toCB-gadgets, we are able to draw a useful connection between sparse gadgets and approximatingpiecewise linear curves with a smaller number of linear pieces. We begin by defining the class of piecewise linear functions in which we are interested.
Definition 3.1.
For r ∈ N , F r is the class of functions f : [0 , ∞ ] −→ R + such that:1. f (0) = 0 f is a constant for all x ≥ r f is increasing: x ≤ x = ⇒ f ( x ) ≤ f ( x ) f is piecewise linear . f is concave (and hence, continuous). It will be key to keep track of the number of linear pieces that make up a given function f ∈ F r .Let L be the set of linear functions with nonnegative slopes and intercept terms: L = { g ( x ) = mx + d | m, d ∈ R + } . (19)Every function f ∈ F r can be characterized as the lower envelope of a set of these linear functions. f ( x ) = min g ∈ L g ( x ) , where L ⊂ L . (20)We use | L | to denote the number of linear pieces of f . In order for (20) to properly characterizea function in F r , it must be constant for all x ≥ r (property 2 in Definition 3.1), and thus L must contain exactly one line of slope zero. The continuous extension ˆ f of an r -CCB function w parameterized by ( a , b ) is defined asˆ f ( x ) = J (cid:88) j =1 a j · min { x, b j } for x ∈ [0 , ∞ ] . (21)We prove that continuously extending any r -CCB function always produces a function in F r .Conversely, every f ∈ F r is the continuous extension of some r -CCB function. Appendix A providesproofs for these results. Lemma 3.1.
Let ˆ f be the continuous extension for w , shown in (21) . This function is in the class F r , and has exactly J positive sloped linear pieces, and one linear piece of slope zero. Lemma 3.2.
Let f be a function in F r with J + 1 linear pieces. Let b i denote the i th breakpointof f , and m i denote the slope of the i th linear piece of f . Define vectors a , b ∈ R J where b ( i ) = b i and a ( i ) = a i = m i − m i +1 for i ∈ [ J ] . If w is the r -CCB function parameterized by vectors ( a , b ) ,then f is the continuous extension of w . Let w ∈ S r be an arbitrary SCB integer function. Lemma 3.2 implies that if we can find a piecewiselinear function f that approximates w and has few linear pieces, we can extract from it a CCBfunction ˆ w with a small order J that approximates w . Equivalently, we can find a sparse gadgetthat approximates an SCB splitting function of interest. Our updated goal is therefore to solve thefollowing piecewise linear approximation problem, for a given w ∈ S r and ε ≥ L ⊂L | L | subject to w ( i ) ≤ f ( i ) ≤ (1 + ε ) w ( i ) for i ∈ [ r ] f ∈ F r f ( x ) = min g ∈ L g ( x )for each g ∈ L , g ( j ) = w ( j ) for some j ∈ { } ∪ [ r ] . (22)The last constraint ensures that each linear piece g ∈ L we consider crosses through at least onepoint ( j, w ( j )). We can add this constraint without loss of generality; if any linear piece g is strictlygreater than w at all integers, we could obtain an improved approximation by scaling g until it istangent to w at some point. This constraint, together with the requirement f ∈ F r , implies thatthe constant function g ( r ) ( x ) = w ( r ) is contained in every set of linear functions L that is feasible10igure 3: We restrict our attention to lines in L that coincide with w at at least one integer value.Thus, every function we consider is incident to two consecutive values of w (e.g., the solid line, g (1) ), or, it touches w at exactly one point (dashed line, g ).for (22). Since all feasible solutions contain this constant linear piece, our focus is on determiningthe optimal set of positive-sloped linear pieces needed to approximate w . Optimal linear covers.
Given a fixed ε ≥ i ∈ { } ∪ [ r − L ⊂ L is a linear cover for a function w ∈ S r over the range R = { i, i + 1 , . . . , r } , if each g ∈ L upper bounds w at all points, and if for each j ∈ R there exists g ∈ L such that g ( j ) ≤ (1 + ε ) w ( j ). The set L isan optimal linear cover if it contains the minimum number of positive-sloped linear pieces neededto cover R . Thus, an equivalent way of expressing (22) is that we wish to find an optimal linearcover for w over the interval { } ∪ [ r ]. In practice there may be many different function f ∈ F r which solve (22), but for our purposes it suffices to find one. We solve problem (22) by iteratively growing a set of linear functions L ⊂ L one function at a time,until all of w is covered. Let f be the piecewise linear function we construct from linear pieces in L . In order for f to upper bound w , every function g ∈ L in problem (22) must upper bound w atevery i ∈ { } ∪ [ r ]. One way to obtain such a linear function is to connect two consecutive pointsof w . For i ∈ { } ∪ [ r − i, w ( i )) and ( i + 1 , w ( i + 1)) by g ( i ) ( x ) = M i ( x − i ) + w ( i ) , (23)where the slope of the line is M i = w ( i + 1) − w ( i ). In order for a line to upper bound w but onlypass through a single point ( i, w ( i )) for some i ∈ [ r − g ( x ) = m ( x − i ) + w ( i ) , (24)where the slope m satisfies M i < m < M i − . The existence of such a line g is only possible whenthe points ( i − , w ( i − i, w ( i )), and ( i + 1 , w ( i + 1)) are not collinear. To understand thestrict bounds on m , note that if g passes through ( i, w ( i )) and has slope exactly M i − , then g isin fact the line g ( i − and also passes through ( i − , w ( i − g has slope greater than M i − ,then g ( i − < w ( i −
1) and does not upper bound w everywhere. We can similarly argue that11he slope of g must be strictly greater than M i so that it does not touch or cross below the point( i + 1 , w ( i + 1)).We illustrate both types of functions (23) and (24) in Figure 3. The following simple observationwill later help in comparing approximation properties of different functions in L . Observation 3.1.
For a fixed w ∈ S r , let g, h ∈ L both upper bound w at all integers i ∈ { } ∪ [ r ] ,and assume that for some j ∈ { } ∪ [ r ] , g ( j ) = h ( j ) = w ( j ) . If m g and m h are the slopes of g and h respectively, and m g ≥ m h ≥ , then • For every integer i ∈ [0 , j ] , w ( i ) ≤ g ( i ) ≤ h ( i ) • For every integer i ∈ [ j, r ] , w ( i ) ≤ h ( i ) ≤ g ( i ) . In other words, if g and h are both tangent to w at the same point j , but g has a larger slopethan h , then g provides a better approximation for values smaller than j , while h is the betterapproximation for values larger than j . The first linear piece.
Every set L solving (22) must include a linear piece that goes throughthe origin, so that f (0) = 0. We specifically choose g (0) ( x ) = ( w (1) − w (0)) x + w (0) = w (1) x to bethe first linear piece in the set L we construct. Given this first linear piece, we can then computethe largest integer i ∈ [ r ] for which g (0) provides a (1 + ε )-approximation: p = max { i ∈ [ r ] | g (0) ( i ) ≤ (1 + ε ) w ( i ) } . The integer (cid:96) = p + 1 therefore is the smallest integer for which we do not have a (1 + ε )-approximation. If (cid:96) ≤ r , our task is then to find the smallest number of additional linearpieces in order to cover { (cid:96), . . . , r } with (1 + ε )-approximations. By Observation 3.1, any other g ∈ L with g (0) = 0 and g (1) > w (1) will be a worse approximation to w at all integer values: w ( i ) ≤ g (0) ( i ) < g ( i ) for all i ∈ [ r ]. Therefore, as long as we can find a minimum set of additionallinear pieces which provides a (1 + ε )-approximation for all { (cid:96), . . . , r } , our set of functions L willoptimally solve objective (22). Iteratively finding the next linear piece.
Consider now a generic setting in which we aregiven a left integer endpoint (cid:96) and we wish to find linear pieces to approximate the function w from (cid:96) to r . We first check whether the constant function g ( r ) ( x ) = w ( r ) provides the desiredapproximation: g ( r ) ( (cid:96) ) ≤ (1 + ε ) w ( (cid:96) ) . (25)If so, we augment L to include g ( r ) and we are done, since this implies that g ( r ) also provides atleast a (1 + ε )-approximation at every i ∈ { (cid:96), (cid:96) + 1 , . . . , r } . If (25) is not true, we must add anotherpositive-sloped linear function to L in order to get the desired approximation for all i ∈ [ r ]. Weadopt a greedy approach that chooses the next line to be the optimizer of the following objectivemax g ∈L p (cid:48) subject to w ( j ) ≤ g ( j ) ≤ (1 + ε ) w ( j ) for j = (cid:96), (cid:96) + 1 , . . . , p (cid:48) . (26)In other words, solving problem (26) means finding a function that provides at least a (1 + ε )-approximation from (cid:96) to as far towards r as possible in order to cover the widest possible contiguousinterval with the same approximation guarantee. (There is always a feasible point by adding a line g tangent to w ( (cid:96) ).) The following Lemma will help us prove that this greedy scheme produces anoptimal cover for w . 12 emma 3.3. Let p ∗ the solution to (26) and g ∗ be the function that achieves it. If ˆ L ⊂ L is anoptimal cover for w over the integer range { p ∗ + 1 , p ∗ + 2 , . . . r } , then { g ∗ } ∪ ˆ L is an optimal coverfor { (cid:96), (cid:96) + 1 , . . . r } .Proof. Let ˜ L be an arbitrary optimal linear cover for w over the range { (cid:96), (cid:96) + 1 , . . . , r } . This meansthat | ˆ L ∪ { g ∗ }| ≥ | ˜ L | . We know ˜ L must contain a function g such that g ( (cid:96) ) ≤ (1 + ε ) w ( (cid:96) ). Let p g be the largest integer satisfying g ( p g ) ≤ (1 + ε ) w ( p g ). By the optimality of p ∗ and g ∗ , we know p ∗ ≥ p g . Therefore, the set of functions ˜ L − { g } must be a cover for the set { p g + 1 , p g + 2 , . . . r } ⊇{ p ∗ + 1 , p ∗ + 2 , . . . r } . Since ˆ L is an optimal cover for a subset of the integers covered by ˜ L − { g } , | ˆ L | ≤ | ˜ L − { g }| = ⇒ | ˆ L | + 1 ≤ | ˜ L | = ⇒ | ˆ L ∪ { g ∗ }| ≤ | ˜ L | . Therefore, | ˆ L ∪ { g ∗ }| = | ˜ L | , so the result follows.We illustrate a simple procedure for solving (26) in Figure 4. The function g solving (26) musteither join two consecutive points of w (the form given in (23)), or coincide at exactly one point of w (form given in (24)). We first identify the integer j ∗ such that g ( j ∗ ) ( (cid:96) ) ≤ (1 + ε ) w ( (cid:96) ) g ( j ∗ +1) ( (cid:96) ) > (1 + ε ) w ( (cid:96) ) . In other words, the linear piece connecting ( j ∗ , w ( j ∗ )) and ( j ∗ + 1 , w ( j ∗ + 1)) provides the neededapproximation at the left endpoint (cid:96) , but g ( i ) for every i > j ∗ does not. Therefore, the solutionto (26) has a slope m ∈ [ M j ∗ , M j ∗ +1 ), and passes through the point ( j ∗ , w ( j ∗ )). By Observa-tion 3.1, the line passing through this point with the smallest slope is guaranteed to provide thebest approximation for all integers p ≥ j ∗ . To minimize the slope of the line while still preservingthe needed approximation at w ( (cid:96) ), we select the line passing through the points ( (cid:96), (1 + ε ) w ( (cid:96) ))and ( j ∗ , w ( j ∗ )). This is given by g ∗ ( x ) = w ( j ∗ ) − (1 + ε ) w ( (cid:96) )( j ∗ − (cid:96) ) ( x − (cid:96) ) + (1 + ε ) w ( (cid:96) ) . (27)After adding this function g ∗ to L , we find the largest integer p ≤ r such that g ∗ ( p ) ≤ (1 + ε ) w ( p ).If p < r , then we still need to find more linear pieces to approximate w , so we continue withanother iteration. If p = r exactly, then we do not need any more positive-sloped linear pieces toapproximate w . However, we still add the constant function g ( r ) to L before terminating. Thisguarantees that the function f ( x ) = min g ∈ L ( x ) we return is in fact in F r . Furthermore, addingthe constant function serves to improve the approximation, without affecting the order of the CCBfunction we will obtain from f by applying Lemma 3.2.Pseudocode for our procedure for constructing a set of function L is given in Algorithm 1, whichrelies on Algorithm 2 for solving (26). We summarize with a theorem about the optimality of ourmethod for solving (22). Theorem 3.4.
Algorithm 1 runs in O ( r ) time and returns a function f that optimizes (22) .Proof. The optimality of the algorithm follows by inductively applying Lemma 3.3 at each iterationof the algorithm. For the runtime guarantee, note first of all that we can compute and store allslopes and intercepts for linear pieces g ( i ) (as given in (23)) in O ( r ) time and space. As thealgorithm progresses, we visit each integer i ∈ [ r ] once, either to perform a comparison of the form g ( i ) ( (cid:96) ) ≤ (1 + ε ) w ( (cid:96) ) for some left endpoint (cid:96) , or to check whether g ∗ ( i ) ≤ (1 + ε ) w ( i ) for somelinear piece g ∗ we added to our linear cover L . Each such g ∗ can be computed in constant time,and as a loose bound we know we compute at most O ( r ) such linear pieces for any ε .13igure 4: Given a left endpoint (cid:96) for which we do not yet have a (1 + ε )-approximate piece, we findthe next linear piece by choosing a function g ∗ that provides the desired approximation at (cid:96) , whilealso providing a good approximation for as large of an integer p > (cid:96) as possible.By combining Algorithm 2 and Lemma 3.2, we are able to efficiently solve Spa-GAP . Theorem 3.5.
Let f be the solution to (22) , and ˆ w be the CCB function obtained from Lemma 3.2based on f . Then ˆ w optimally solves the sparse gadget approximation problem (18) .Proof. Since f and ˆ w coincide at integer values, and f approximates w at integer values, we know w ( i ) ≤ ˆ w ( i ) ≤ (1 + ε ) w ( i ) for i ∈ [ r ]. Thus, ˆ w is feasible for objective (18). If κ ∗ is the numberof positive-sloped linear pieces of f , then the order of ˆ w is κ ∗ by Lemma 3.2, and this must beoptimal for (18). If it were not optimal, this would imply that there exists some upper boundingCCB function w (cid:48) of order κ (cid:48) < κ ∗ that approximates w to within 1 + ε . But by Lemma 3.1, thiswould imply that the continuous extension of w (cid:48) is some f (cid:48) ∈ F r with exactly κ (cid:48) positive-slopedlinear pieces that is feasible for objective (22), contradicting the optimality of f . In our last section we showed an efficient strategy for finding the minimum number of linear piecesneeded to approximate an SCB integer function. We now consider bounds on the number ofneeded linear pieces in different cases, and highlight implications for sparsifying hyperedges withSCB splitting functions. In the worst case, we show that we need O (log k/ε ) gadgets, where k is thesize of the hyperedge. Moreover, this is nearly tight for the square root splitting function. Finally,we show that we only need O ( ε − / log log ε ) gadgets to approximate the clique splitting function.This result is useful for sparsifying co-occurrence graphs and clique expansions of hypergraphs. O (log k/ε ) Upper Bound
We begin by showing that a logarithmic number of CB-gadgets is sufficient to approximate anySCB splitting function. 14 lgorithm 1
FindBest-PL-Approx ( w , ε ) (solves (22)) Input: w ∈ S r , ε ≥ Output: f ∈ F r optimizing (22) L = { g (0) } , where g (0) = w (1) xp = max { i ∈ [ r ] | g (0) ( i ) ≤ (1 + ε ) w ( i ) } (cid:96) = p + 1 while (cid:96) ≤ r do ( g ∗ , p ) = FindNext ( w , ε, (cid:96) ) (cid:96) ← p + 1 L ← L ∪ { g ∗ } if p = r then L ← L ∪ { g ( r ) } , where g ( r ) ( x ) = w ( r ) end ifend while Return f defined by f ( x ) = min g ∈ L g ( x ) Algorithm 2
FindNext ( w , ε, (cid:96) ) (solves (26)) Input: w ∈ S r , ε ≥ (cid:96) ∈ [ r ] Output: g ∈ L optimizing (26) if w ( r ) ≤ (1 + ε ) w ( (cid:96) ) then Return ( g ( r ) , r + 1), where g ( r ) ( x ) = w ( r ) else j ∗ = (cid:96) while g ( j ∗ +1) ( (cid:96) ) ≤ (1 + ε ) w ( (cid:96) ) do j ∗ = j ∗ + 1 end while g ∗ ( x ) = w ( j ∗ ) − (1+ ε ) w ( (cid:96) )( j ∗ − (cid:96) ) ( x − (cid:96) ) + (1 + ε ) w ( (cid:96) ) p = max { i ∈ [ r ] | g ∗ ( p ) ≤ (1 + ε ) w ( p ) } Return ( g ∗ , p ) end ifTheorem 4.1. Let ε ≥ and w e be an SCB splitting function on a k -node hyperedge. There existsa set of O (log ε k ) CB-gadgets, which can be constructed in O ( k log ε k ) time, whose splittingfunction ˆ w e satisfies w e ( A ) ≤ ˆ w e ( A ) ≤ (1 + ε ) w e ( A ) for all A ⊆ e .Proof. Let r = (cid:98) k/ (cid:99) , and let w ∈ S r be the SCB integer function corresponding to w e , i.e., w ( i ) = w e ( A ) for A ⊆ e such that | A | ∈ { i, k − i } . If we join all points of the form ( i, w ( i )) for i ∈ [ r ] by a line, this results in a piecewise linear function f ∈ F r that is concave and increasing onthe interval [0 , r ]. We first show that there exists a set of O (log ε r ) linear pieces that approximates f on the entire interval [1 , r ] to within a factor (1+ ε ). Our argument follows similar previous resultsfor approximating a concave function with a logarithmic number of linear pieces [26, 50].For any value y ∈ [1 , r ], not necessarily an integer, f ( y ) lies on a linear piece of f which we willdenote by g ( y ) ( x ) = M y · x + B y , where M y ≥ B y ≥ y = i is an integer, it may be the breakpoint between two distinct linear pieces, in which case we use therightmost line so that g ( y ) = g ( i ) as in (23), so g ( i ) ( x ) = M i · x + B i where M i = w ( i + 1) − w ( i ) and B i = w ( i ) − M i · i . For any z ∈ ( y, r ), the line g ( y ) provides a z/y approximation to f ( z ) = g ( z ) ( z ),15ince g ( y ) ( z ) = M y · z + B y ≤ zy ( M y · y + B y ) = zy f ( y ) ≤ zy f ( z ) . Equivalently, the line g ( y ) provides a (1 + ε )-approximation for every z ∈ [ y, (1 + ε ) y ]. Thus, it takes J linear pieces to cover the set of intervals [1 , (1 + ε )] , [(1 + ε ) , (1 + ε ) ] , . . . , [(1 + ε ) J − , (1 + ε ) J ]for a positive integer J , and overall at most 1 + (cid:100) log ε r (cid:101) linear pieces to cover all of [0 , r ].Since Algorithm 1 finds the smallest set of linear pieces to (1 + ε )-cover the splitting penalties,this smallest set must also have at most O (log ε r ) linear pieces. Given this piecewise linearapproximation, we can use Lemma 3.2 to extract a CCB function ˆ w of order J = O (log ε r )satisfying w ( i ) ≤ ˆ w ( i ) ≤ (1 + ε ) w ( i ) for i ∈ { } ∪ [ r ]. This ˆ w in turn corresponds to a set of J CB-gadgets that (1 + ε )-approximates the splitting function w e . Computing edge weights forthe CB-gadgets using Algorithm 1 and Lemma 3.2 takes only O ( r ) time, so the total runtime forconstructing the combined gadgets is equal to the number of individual edges that must be placed,which is O ( k log ε k ).Theorem 1.1 on augmented sparsifiers follows as a corollary of Theorem 4.1. Given a hypergraph H = ( V, E ) where each hyperedge has an SCB splitting function, we can use Theorem 4.1 to expandeach e ∈ E into a gadget that has O (log ε | e | ) auxiliary nodes and O ( | e | log ε | e | ) edges. Sincelog ε n behaves as ε log n as ε →
0, Theorem 1.1 follows.In Appendix B, we show that using a slightly different reduction, we can prove that Theorem 4.1holds even when we do not require splitting functions to be symmetric or satisfy w e ( ∅ ) = w e ( e ) = 0.In Section 6 we use this fact to develop approximation algorithms for cardinality-based decompos-able submodular function minimization. Next we show that our upper bound is nearly tight for the square root r -SCB integer function, w ( i ) = √ i for i ∈ { } ∪ [ r ] . (28)For this result, we rely on a result previously shown by Magnanti and Stratila [50] on the numberof linear pieces needed to approximate the square root function over a continuous interval. Lemma 4.2. (Lemma 3 in [50]) Let ε > and φ ( x ) = √ x . Let ψ be a piecewise linear functionwhose linear pieces are all tangent lines to φ , satisfying ψ ( x ) ≤ (1 + ε ) φ ( x ) for all x ∈ [ l, u ] for < l < u . Then ψ contains at least (cid:100) log γ ( ε ) ul (cid:101) linear pieces, where γ ( ε ) = (1 + 2 ε (2 + ε ) + 2(1 + ε ) (cid:112) ε (2 + ε )) . There exists a piecewise linear function ψ ∗ of this form with exactly (cid:100) log γ ( ε ) ul (cid:101) linear pieces. As ε → , this values behaves as √ ε log ul . Lemma 4.2 is concerned with approximating the square root function for all values on a contin-uous interval. Therefore, it does not immediately imply any bounds on approximating a discreteset of splitting penalties. In fact, we know that when lower bounding the number of linear piecesneeded to approximate any w ∈ S r , there is no lower bound of the form q ( ε ) f ( r ) that holds for all ε >
0, if q is a function such that q ( ε ) → ∞ as ε →
0. This is simply because we can approximate w by piecewise linear interpolation, leading to an upper bound of O ( r ) linear pieces even when ε = 0.Therefore, the best we can expect is a lower bound that holds for ε values that may still go to zero This additional statement is not included explicitly in the statement of Lemma 3 in [50], but it follows directlyfrom the proof of the lemma, which shows how to construct such an optimal function ψ ∗ . r → ∞ , but are bounded in such a way that we do not contradict the O ( r ) upper bound thatholds for all SCB integer functions. We prove such a result for the square root splitting function,using Lemma 4.2 as a black box. When ε falls below the bound we assume in the following theoremstatement, forming O ( r ) linear pieces will be nearly optimal. Theorem 4.3.
Let ε > and w ( i ) = √ i be the square root r -SCB integer function. If ε ≥ r − δ forsome constant δ ∈ (0 , , then any piecewise linear function providing a (1 + ε ) -approximation for w contains Ω(log γ ( ε ) r ) linear pieces, which behaves as Ω( ε − / log r ) as ε → .Proof. Let L ∗ be the optimal set of linear pieces returned by running Algorithm 1. In order toshow | L ∗ | = Ω(log γ ( ε ) r ), we will construct a new set of linear pieces L that has asymptoticallythe same number of linear pieces as L ∗ , but also provides a (1 + ε )-approximation for all x in aninterval [ r β , r ] for some constant β <
1. Invoking Lemma 4.2 will then guarantee the final result.Recall that L ∗ includes only two types of linear pieces: either linear pieces g satisfying g ( j ) = √ j for exactly one integer j (see (24)), or linear pieces formed by joining two points of w (see (23)).For the square root splitting function, the latter type of linear piece is of the form g ( t ) ( i ) = ( √ t + 1 − √ t )( i − t ) + √ t, (29)for some positive integer t less than r . This is the linear interpolation of the points ( t, √ t ) and( t + 1 , √ t + 1). Both types of linear pieces bound φ ( x ) = √ x above at integer points, but theymay cross below φ at non-integer values of x . To apply Lemma 4.2, we would like to obtain a setof linear pieces that are all tangent lines to φ . We accomplish this by replacing each linear piece in L ∗ with two or three linear pieces that are tangent to φ at some point. For a positive integer j , let g j denote the line tangent to φ ( x ) = √ x at x = j , which is given by g j ( x ) = 12 √ j ( x − j ) + (cid:112) j. (30)We form a new set of linear pieces L made up of lines tangent to φ using the following replacements: • If L ∗ contains a linear piece g that satisfies g ( j ) = √ j for exactly one integer j , add lines g j − , g j , and g j +1 to L . • If for an integer t , L ∗ contains the line g ( t ) as given by Eq. (29), add lines g t and g t +1 to L .By Observation 3.1, this replacement can only improve the approximation guarantee at integerpoints. Therefore, L provides a (1 + ε )-approximation at integer values, is made up strictly of linesthat are tangent to φ , and contains at most three times the number of lines in L ∗ .Due to the concavity of φ , if a single line g ∈ L provides a (1 + ε )-approximation at consecutiveintegers i and i +1, then g provides the same approximation guarantee for all x ∈ [ i, i +1]. However,if two integers i and i + 1 are not both covered by the same line in L , then L does not necessarilyprovide a (1 + ε )-approximation for every x ∈ [ i, i + 1]. There can be at most | L | intervals of thisform, since each interval defines an “intersection” at which one line g ∈ L ceases to be a (1 + ε )-approximation, and another line g (cid:48) ∈ L “takes over” as the line providing the approximation.By Lemma 4.2, we can cover an entire interval [ i, i +1] for any integer i using a set of (cid:100) log γ ( ε ) (cid:0) i (cid:1) (cid:101) linear pieces that are tangent to φ somewhere in [ i, i + 1]. Since 1 + √ ε ≤ γ ( ε ), it in fact takesonly one linear piece to cover [ i, i + 1] as long as 1 + 1 /i ≤ √ ε = ⇒ i ≥ / √ ε . Since ε ≥ r − δ ,interval [ i, i + 1] can be covered by a single linear piece if i ≥ r δ/ . Therefore, for each interval[ i, i + 1], with i ≥ r δ/ , that is not already covered by a single linear piece in L , we add one morelinear piece to L to cover this interval. This at most doubles the size of L .17he resulting set L will have at most 6 times as many linear pieces as L ∗ , and is guaranteed toprovide a (1 + ε )-approximation for all integers, as well as the entire continuous interval [ r δ/ , r ].Since δ is a fixed constant strictly less than 2, applying Lemma 4.2 shows that L has at least (cid:108) log γ ( ε ) rr δ/ (cid:109) = Ω(log γ ( ε ) r − δ/ ) = Ω(log γ ( ε ) r )linear pieces. Therefore, | L ∗ | = Ω(log γ ( ε ) r ) as well. When approximating the clique expansion splitting function, Algorithm 1 will in fact find a piece-wise linear curve with at most O ( ε − / log log ε ) linear pieces. We prove this by highlighting adifferent approach for constructing a piecewise linear curve with this many linear pieces, whichupper bounds the number of linear pieces in the optimal curve found by Algorithm 1.Clique splitting penalties for a k -node hyperedge correspond to nonnegative integer values ofthe continuous function ζ ( x ) = x · ( k − x ). As we did in Section 3.3, we want to build a set oflinear pieces L that provides and upper bounding (1 + ε )-cover of ζ at integer values in [0 , r ], where r = (cid:98) k/ (cid:99) . We start by adding the line g (0) ( x ) = ( w (1) − w (0)) x + w (0) = ( k − · x to L , whichperfectly covers the first two splitting penalties w (0) = 0 and w (1) = k −
1. In the remainder ofour new procedure we will find a set of linear pieces to (1 + ε )-cover ζ at every value of x ∈ [1 , k/ x .We apply a greedy procedure similar to Algorithm 1. At each iteration we consider a leftmostendpoint z i which is the largest value in [1 , k/
2] for which we already have a (1 + ε )-approximation.In the first iteration, we have z = 1. We then would like to find a new linear piece that providesa (1 + ε )-approximation for all values from z i to some z i +1 , where the value of z i +1 is maximized.We restrict to linear pieces that are tangent to ζ . The line tangent to ζ at t ∈ [1 , k/
2] is given by g t ( x ) = kx − tx + t . (31)We find z i +1 in two steps:1. Step 1:
Find the maximum value t such that g t ( z i ) = (1 + ε ) ζ ( z i ).2. Step 2:
Given t , find the maximum z i +1 such that g t ( z i +1 ) = (1 + ε ) ζ ( z i +1 ).After completing these two steps, we add the linear piece g t to L , knowing that it covers all values in[ z i , z i +1 ] with a (1 + ε )-approximation. At this point, we will have a cover for all values in [0 , z i +1 ],and we begin a new iteration with z i +1 being the largest value covered. We continue until we havecovered all values up until z i +1 ≥ k/
2. If t > k/ ζ at x = k/
2, so that we only include lines that havea nonnegative slope.
Lemma 4.4.
For any z i ∈ [1 , k/ , the values of t and z i +1 given in steps 1 and 2 are given by t = z i + (cid:112) z i ( k − z i ) ε (32) z i +1 = t ε + kε ε ) + 12(1 + ε ) (cid:0) k ε + 4 εt ( k − t ) (cid:1) / (33) Proof.
The proof simply requires solving two different quadratic equations. For Step 1: g t ( z i ) = (1 + ε ) ζ ( z i ) ⇐⇒ kz i − tz i + t = (1 + ε )( z i k − z i ) ⇐⇒ t − z i t − εz i k + (1 + ε ) z i = 018 lgorithm 3 Find a (1 + ε )-cover L for the clique splitting function. Input:
Hyperedge size k , ε ≥ Output: (1 + ε ) cover for clique splitting function. L = { g (0) } , where g (0) ( x ) = ( k − xz = 1 do t ← z + (cid:112) z ( k − z ) εz ← t ε + kε ε ) + ε ) (cid:0) k ε + 4 εt ( k − t ) (cid:1) / if t > k/ then L ← L ∪ { g k/ } , where g k/ ( x ) = k/ else L ← L ∪ { g t } , where g t ( x ) = kx − tx + t end ifwhile z i +1 < k/ f defined by f ( x ) = min g ∈ L g ( x )Taking the larger solution to maximize t : t = 12 (cid:18) z i + (cid:113) z i − ε ) z i + 4 εkz i (cid:19) = z i + (cid:112) z i ( k − z i ) ε. For Step 2: g t ( z i +1 ) = (1 + ε ) ζ ( z i +1 ) ⇐⇒ kz i +1 − tz i +1 + t = (1 + ε )( z i +1 k − z i +1 ) ⇐⇒ (1 + ε ) z i +1 + z i +1 ( − εk − t ) + t = 0 . We again take the larger solution to this quadratic equation since we want to maximize z i +1 : z i +1 = 12(1 + ε ) (cid:16) εk + 2 t + (cid:112) ε k + 4 tεk + 4 t − ε ) t (cid:17) = 12(1 + ε ) (cid:16) εk + 2 t + (cid:112) ε k + 4 tε ( k − t ) (cid:17) . Algorithm 3 summarizes the new procedure for covering the clique splitting function. Since z = 1, if ε ≥
1, then z ≥ ε ) (2 kε ) = kε ε ≥ k , so after one step we have covered the entire interval [1 , k/ ε < Theorem 4.5.
For ε < , if L is the output from Algorithm 3, then | L | = O ( ε − / log log ε ) .Proof. We get a loose bound for the value of t in Lemma 4.4 by noting that ( k − z i ) ≥ k/ ≥ z i : t = z i + (cid:112) z i ε ( k − z i ) ≥ z i + (cid:113) z i ε = z i (1 + √ ε ) . (34)Since we assumed ε <
1, we know that t ε ≥ z i (1 + √ ε )1 + ε > z i . (35)19herefore, from (33) we see that z i +1 > z i + kε ε ) + 12(1 + ε ) (cid:0) k ε + 4 εt ( k − t ) (cid:1) / (36) > z i + kε ε ) + 12(1 + ε ) (cid:0) k ε (cid:1) / = z i + kε ε . (37)From this we see that at each iteration, we cover an additional interval of length z i +1 − z i > kε ε ,and therefore we know it will take at most O (1 /ε ) iterations to cover all of [1 , k/ z i +1 − z i in fact increases significantly with each iteration,allowing the algorithm to cover larger and larger intervals as it progresses.Since z = 1 and z i +1 − z i ≥ kε ε , we see that z j ≥ kε for all j ≥
3. For the remainder ofthe proof, we focus on bounding the number of iterations it takes to cover the interval [ kε, k/ j refers to the set ofiterations that the algorithm spends to cover the interval R j = (cid:104) kε ( ) j − , kε ( ) j (cid:105) , (38)For example, Round 1 starts with the iteration i such that z i ≥ kε , and terminates when thealgorithm reaches an iteration i (cid:48) where z i (cid:48) ≥ kε / . A key observation is that it takes less than4 / √ ε iterations for the algorithm to finish Round j for any value of j . To see why, observe thatfrom the bound in (36) we have z i +1 − z i > kε ε ) + 12(1 + ε ) (cid:0) k ε + 4 εt ( k − t ) (cid:1) / > ε ) (4 εt ( k − t )) / ≥ ε ) (cid:18) εz i k (cid:19) / > √ √ kε (1 + ε ) √ z i . For each iteration i in Round j , we know that z i ≥ kε ( ) j − , so that z i +1 − z i > √ √ kε (1 + ε ) (cid:113) kε ( ) j − ≥ √ kε + ( ) j ε = C · k · ε + ( ) j , (39)where C = √ / (2(1 + ε )) is a constant larger than 1 /
4. Since each iteration of Round j covers aninterval of length at least C · k · ε + ( ) j , and the right endpoint for Round j is kε ( ) j , the maximumnumber of iterations needed to complete Round j is kε ( ) j C · k · ε + ( ) j = 1 C √ ε . (40)Therefore, after p rounds, the algorithm will have performed O ( p · ε − / ) iterations, to cover theinterval [1 , kε ( ) p ]. Since we set out to cover the interval [1 , k/ p satisfies ε ( ) p ≥ /
2, which holds as long as p ≥ log log ε : ε ( ) p ≥ / ⇐⇒ (cid:18) (cid:19) p log ε ≥ − ⇐⇒ log ε ≥ − p ⇐⇒ log ε ≤ p ⇐⇒ log log ε ≤ p. This means that the number of iteration of Algorithm 3, and therefore the number of linear piecesin L , is bounded above by O ( ε − / log log ε ).We obtain a proof of Theorem 1.3 on sparsifying the complete graph as a corollary. Proof of Theorem 1.3.
A complete graph on n nodes can be viewed as a hypergraph witha single n -node hyperedge with a clique expansion splitting function. Theorem 1.3 says that theclique expansion integer function w ( i ) = i · ( n − i ) can be covered with O ( ε − / log log ε − ) linearpieces, which is equivalent to saying the clique expansion splitting function can be modeled usingthis many CB-gadgets. Each CB-gadget has two auxiliary nodes and (2 n + 1) directed edges. Thisresults in an augmented sparsifier for the complete graph with O ( nε − / log log ε − ) edges. This isonly meaningful if ε is small enough so that O ( ε − / log log ε − ) is asymptotically less than n , soour sparsifier has O ( n + ε − / log log ε − ) = O ( n ) nodes. (cid:3) Recall from the introduction that a co-occurrence graph is formally defined by a set of nodes V and a set of subsets C ⊆ V . In practice, each c ∈ C could represent some type of group interactioninvolving nodes in c or a set of nodes sharing the same attribute. We define the co-occurrence graph G = ( V, E ) on C to be the graph where nodes i and j share an edge with weight w ij = (cid:80) c ∈C w c ,where w c ≥ c ∈ C . The case when w c = 1 is standardand is an example of a common practice of “one-mode projections” of bipartite graphs or affiliationnetworks [11, 40, 42, 52, 53, 57, 72] — a graph is formed on the nodes from one side of a bipartitegraph by connecting two nodes whenever they share a common neighbor on the other side, whereedges are weighted based on the number of shared neighbors.A co-occurrence graph G has the following co-occurrence cut function: cut G ( S ) = (cid:88) c ∈C w c · | S ∩ c | · | ¯ S ∩ c | . (41)In this sense, the co-occurrence graph is naturally interpreted as a weighted clique expansion of ahypergraph H = ( V, C ), which itself is a special case of reducing a submodular, cardinality-basedhypergraph to a graph. However, this type of graph construction is by no means restricted toliterature on hypergraph clustering. In many applications, the first step in a larger experimentalpipeline is to construct a graph of this type from a large dataset. The resulting graph is often quitedense, as numerous domains involve large hyperedges [56, 67]. This makes it expensive to form,store, and compute over co-occurrence graphs in practice.21olving cut problems on these dense co-occurrence graphs arises naturally in many settings. Forexample, any hypergraph clustering application that relies on a clique expansion involves a graphwith a co-occurrence cut function [1, 28–30, 44, 58, 65, 67, 68, 71, 73]. Clustering social networks isanother use case, as online platforms have many ways to create groups of users (e.g., events, specialinterest groups, businesses, organizations, etc.), that can be large in practice. Furthermore, cutsin co-occurrence graphs of students on a university campus (based on, e.g., common classes, livingarrangements, or physical proximity) are relevant to preventing the spread of infectious diseasessuch as COVID-19.In these cases, it would be more efficient to sparsify the graph without ever forming it explicitly ,by sparsifying large cliques induced by co-occurrence relationships. Although this strategy seemsintuitive, it is often ignored in practice. We therefore present several theoretical results that high-light the benefits of this implicit approach to sparsification. Our focus is on results that can beachieved using augmented sparsifiers for cliques, though many of the same benefits could also beachieved with standard sparsification techniques. Let C be a set of nonempty co-occurrence groups on a set of n nodes, V , and let G = ( V, E ) bethe corresponding co-occurrence graph on C . For c ∈ C , let k c = | c | be the number of nodes in c . For v ∈ V , let d v be the co-occurrence degree of v : the number of sets c containing v . Let d avg = n (cid:80) v ∈ V d v be the average co-occurrence degree. We re-state and prove Theorem 1.4, firstpresented in the introduction. The proof holds independent of the weight w c we associate with each c ∈ C , since we can always scale our graph reduction techniques by an arbitrary positive weight. Theorem.
Let ε > and f ( ε ) = ε − / log log ε . There exists an augmented sparsifier for G with O ( n + |C| · f ( ε )) nodes and O ( n · d avg · f ( ε )) edges. In particular, if d avg is constant and for some δ > we have (cid:80) c ∈C | c | = Ω( n δ ) , then forming G explicitly takes Ω( n δ ) time, but an augmentedsparsifier for G with O ( nf ( ε )) nodes and O ( nf ( ε )) edges can be constructed in O ( nf ( ε )) time.Proof. The set c induces a clique in the co-occurrence graph with O ( k c ) edges. Therefore, theruntime for explicitly forming G = ( V, E ) by expanding cliques and placing all edges equals O ( (cid:80) c ∈C k c ) = Ω( n δ ). By Theorem 1.3, for each c ∈ C we can produce an augmented spar-sifier with O ( k c f ( ε )) directed edges and O ( f ( ε )) new auxiliary nodes. Sparsifying each clique inthis way will produce an augmented sparsifier ˆ G = ( ˆ V , ˆ E ) where | ˆ E | = (cid:88) c ∈C O ( k c f ( ε )) = O ( f ( ε ) · n · d avg ) (42) | ˆ V | = n + (cid:88) c ∈C O ( f ( ε )) = O ( n + |C| f ( ε )) . (43)Observe that n · d avg = (cid:80) v ∈ V d v = (cid:80) c ∈C k c . If d avg is a constant, this implies that (cid:80) c ∈C k c = O ( n ),and furthermore that |C| = O ( n ), since each k c ≥
1. Therefore | ˆ E | and | ˆ V | are both O ( nf ( ε )).Only O ( f ( ε )) edge weights need to be computed for the clique, so the overall runtime is just thetime it takes to explicitly place the O ( nf ( ε )) edges.The above theorem and its proof includes the case where |C| = o ( n ), meaning that C is madeup of a sublinear number of large co-occurrence interactions. In this case, our augmented sparsifierwill have fewer than O ( nf ( ε )) nodes. When |C| = ω ( n ), the average degree will no longer be aconstant and therefore it becomes theoretically beneficial to sparsify each clique in C using standard22ndirected sparsifiers. For each c ∈ C , standard cut sparsification techniques will produce an ε -cut sparsifier of c with O ( k c ε − ) undirected edges and exactly k c nodes. If two nodes appearin multiple co-occurrence relationships, the resulting edges can be collapsed into a weighted edgebetween the nodes, meaning that the number of edges in the resulting sparsifier does not depend on d avg . We discuss tradeoffs between different sparsification techniques in depth in a later subsection.Regardless of the sparsification technique we apply in practice, implicitly sparsifying a co-occurrencegraph will often lead to a significant decrease in runtime compared to forming the entire graph priorto sparsifying it. We now consider a simple model for co-occurrence graphs with a power-law group size distribution,that produces graphs satisfying the conditions of Theorem 1.4 in a range of different parametersettings. Such distributions have been observed for many types of co-occurrence graphs constructedfrom real-world data [11, 19]. More formally, let V be a set of n nodes, and assume a co-occurrenceset c is randomly generated by sampling a set of size K from a discrete power-law distributionwhere for k ∈ [1 , n ]: P [ K = k ] = Ck − γ . Here, C is a normalizing constant for the distribution, and γ and is a parameter of the model.Once K is drawn from this model, a co-occurrence set c is generated with a set of K nodes from V chosen uniformly at random. This procedure can be repeated an arbitrary number of times(drawing all sizes K independently) to produce a set of co-occurrence sets C . This C can then beused to generate a co-occurrence graph G = ( V, E ). (The end result of this procedure is a type of random intersection graph [12].) We first consider a parameter regime where set sizes are constanton average but large enough to produce a dense co-occurrence graph that is inefficient to explicitlyform in practice. The regime has an exponent γ ∈ (2 , Theorem 5.1.
Let C be a set of O ( n ) co-occurrence sets obtained from the power-law model with γ ∈ (2 , . The expected degree of each node will be constant and E (cid:2)(cid:80) c ∈C | c | (cid:3) = O ( n − γ ) .Proof. Let K be the size of a randomly generated co-occurrence set. We compute: E [ K ] = n (cid:88) k =1 k · P [ K = k ] = C · n (cid:88) k =1 k − γ ≤ C · (cid:20) (cid:90) n x − γ dx (cid:21) = C + Cn − γ − γ − C − γ = O ( n − γ ) . Therefore, E (cid:34)(cid:88) c ∈C | c | (cid:35) = (cid:88) c ∈C E [ K ] = O ( n − γ ) . For a node v ∈ V and a randomly generated set c , the probability that v will be selected to be in c is P [ v ∈ c ] = n (cid:88) k =1 P [ | c | = k ] · (cid:0) n − k − (cid:1)(cid:0) nk (cid:1) = C · n (cid:88) k =1 k − γ · kn = Cn · (cid:20) (cid:90) n x − γ dx (cid:21) = O ( n − ) . Since there are O ( n ) co-occurrence sets in C and they each are generated independently, in expec-tation, v will have a constant degree. 23e similarly consider another regime of co-occurrence graphs where the number of co-occurrencesets is asymptotically smaller than n , but the co-occurrence sets are larger on average. Theorem 5.2.
Let C be a set of O ( n β ) co-occurrence sets, where β ∈ (0 , , obtained from thepower-law co-occurrence model with γ = 1 + β . Then the expected degree of each node will be aconstant and E (cid:2)(cid:80) c ∈C | c | (cid:3) = O ( n ) .Proof. Again let K be a random variable representing the co-occurrence set size. We have E [ K ] = C · n (cid:88) k =1 k − γ = O ( n − γ ) = ⇒ E (cid:34)(cid:88) c ∈C | c | (cid:35) = O ( n β +4 − γ ) = O ( n ) . For a node v ∈ V and a randomly generated set c , the probability that v will be in c is P [ v ∈ c ] = n (cid:88) k =1 P [ | c | = k ] · (cid:0) n − k − (cid:1)(cid:0) nk (cid:1) = Cn n (cid:88) k =1 k − γ = O ( n − γ − ) = O ( n − β )Since there are O ( n β ) co-occurrence sets in C , the expected degree of v is a constant.In Theorem 5.2, the exponent of the power-law distribution is assumed to be directly relatedto the number of co-occurrence sets in C . This assumption is included simply to ensure that weare in fact considering co-occurrence graphs with O ( n ) nodes. We could alternatively consider apower-law distribution with exponent γ ∈ (1 ,
2) and generate O ( n β ) co-occurrence sets for any β < − γ . We simply note that in this regime, the expected average degree will be o (1). Assumingwe exclude isolated nodes, this will produce a co-occurrence graph with o ( n ) nodes in expectation.Our techniques still apply in this setting, and we can produce augmented sparsifiers with O ( |C|· f ( ε ))nodes and O ( n · d avg · f ( ε )) = o ( n · f ( ε )) edges. When |C| = Ω( n ), then d avg = Ω(1) and the numberof edges in our augmented sparsifiers will have worse than linear dependence on n . However, inthis regime we can still quickly obtain sparsifiers with O ( nε − ) edges via implicit sparsification byusing standard undirected sparsifiers.More sophisticated models for generating co-occurrence graphs can also be derived from existingmodels for projections of bipartite graphs [11–13]. These make it possible to set different distri-butions for node degrees in V and highlight other classes of co-occurrence graphs satisfying theassumptions of Theorem 1.4. Here we have chosen to focus on the simplest model for illustratingclasses of power-law co-occurrence graphs that satisfy the assumptions of the theorem. There are several tradeoffs to consider when using different black-box sparsifiers for implicit co-occurrence sparsification. Standard sparsification techniques involve no auxiliary nodes, and haveundirected edges, which is beneficial in numerous applications. Also, the number of edges theyrequire is independent of d avg . Therefore, in cases where the average co-occurrence degree is largerthan a constant, we obtain better theoretical improvements using standard sparsifiers.On the other hand, in many settings, it is natural to assume the number of co-occurrences eachnode belongs to is a constant, even if some co-occurrences are very large. In these regimes, ouraugmented sparsifiers will have fewer edges that traditional sparsifiers due to a better dependenceon ε . Our techniques are also deterministic and our sparsifiers are very easy to construct in practice.Edge weights for our sparsifiers are easy to determine in O ( f ( ε )) time for each co-occurrence groupusing Algorithm 1 (or Algorithm 3) coupled with Lemma 3.2. The bottleneck in our construction24s simply visiting each node in a set c to place edges between it and the auxiliary nodes. Even incases where there are no asymptotic reductions in theoretical runtime, our techniques provide asimple and highly practical tool for solving cut problems on co-occurrence data. Appendix B shows how our sparse reduction techniques can be adjusted to apply even when split-ting functions are asymmetric and are not required to satisfy w e ( ∅ ) = w e ( e ) = 0 (the non-cutignoring property). Section 3 addresses the special case of symmetric and non-cut ignoring func-tions, as these assumptions are more natural for hypergraph cut problems [44, 46, 66], and providethe clearest exposition of our main techniques and results. Furthermore, applying the generalizedasymmetric reduction strategy in Appendix B to a symmetric splitting function would introducetwice as many edges as applying the reduction from Section 3 designed explicitly for the symmetriccase. Nevertheless, the same asymptotic upper bound of O ( ε log k ) edges holds for approximatelymodeling the more general splitting function on a k -node hyperedge. By dropping the symmetryand non-cut ignoring assumptions, our techniques lead to the first approximation algorithms forthe more general problem of minimizing cardinality-based decomposable submodular functions. Any submodular function can be minimized in polynomial time [31, 32, 55], but the runtimes forgeneral submodular functions are impractical in most cases. A number of recent papers have devel-oped faster algorithms for minimizing submodular functions that are sums of simpler submodularfunctions [20, 21, 33, 34, 39, 45, 54, 64]. This is also know as decomposable submodular functionminimization (DSFM). Many energy minimization problems from computer vision correspond toDSFM problems [24, 37, 38].Let f : 2 V → R + be a submodular function, such that for S ⊆ V , f ( S ) = (cid:88) e ∈E f e ( S ∩ e ) , (44)where for each e ∈ E , f e is a simpler submodular function with support only on a subset e ⊆ V .We can assume without loss of generality that every f e is a non-negative function. The goal ofDFSM is to find arg min S f ( S ). The terminology used for problems of this form differs dependingon the context. We will continue to refer to E as a hyperedge set, V as a node set, f e as generalizedsplitting functions, and f as some type of generalized hypergraph cut function.Much previous research explicitly considers the case where each function f e is given by f e ( S ) = g e ( | S | ) for some concave function g e [33, 34, 37, 39, 64]. Unlike existing work on generalized hyper-graph cut functions [44, 46, 66], research on DFSM does not typically assume that the functions f e are symmetric, and also do not assume that f e ( ∅ ) = f e ( e ) = 0.25 .2 Notation for Runtime Comparisons Let n = | V | , R = |E| , µ = (cid:80) e ∈E | e | , and let k avg = µR denote the average hyperedge size. Note that (cid:88) e ∈E log | e | ≤ R log n (cid:88) e ∈E | e | log | e | ≤ µ log n max { n, R } ≤ µ ≤ n · R. We primarily focus on how our techniques enable us to obtain runtimes that are strictly betterin terms of number of nodes, number of edges, and average hyperedge size, by producing anapproximate solution. We use ˜ O notation to hide logarithmic factors of n and R . In order tocompare weakly polynomial runtimes, in some cases we restrict to the case where f e has integeroutputs. For this case, we let F max = max S ⊆ V f ( S ), and assume log F max is small enough that itcan also be absorbed by ˜ O notation. We also consider strongly polynomial runtimes that can beobtained for arbitrary edge weights. Previous research on DFSM has focused largely on runtimesfor finding exact solutions. Our goal is to highlight improved runtimes that can be obtained if weare content with solutions that are within a factor (1 + ε ) of optimality. Appendix B shows how our reduction techniques enable us to approximately minimize a cardinality-based DFSM problem. This can be accomplished by solving a directed minimum s - t cut problemon a reduced graph with N = O ( n + ε − R log n ) nodes and M = O ( ε − µ log n ) edges. We usethis to obtain the strongly polynomial runtime guarantee of ˜ O (min { ε − / µ / , ε − µ ( n + ε − R ) / } )given in Theorem 1.5, which we now prove. Proof of Theorem 1.5
Proof.
The runtime comes from applying the directed s - t cut solvers of of Goldberg and Rao [27].Although Goldberg and Rao assume integer edge weights and report weakly polynomial runtimesfor exact s - t cut solution, as long as we are content with approximate solutions, slight adjustmentsallow us to obtain a strongly polynomial runtime for arbitrary weights.If ε is a constant greater than one, we can decrease it to equal 1 and get a better approximationwith the same asymptotic runtime. In the remainder of the proof, we therefore assume ε ≤ ε = ε/ G = ( ˆ V , ˆ E ) be the directed graph resulting from our approximate reductiontechniques with parameter ε . This graph has N = O ( n + ε − R log n ) nodes and M = O ( ε − µ log n )edges, and distinguished source and sink nodes s and t so that the minimum s - t cut corresponds toa (1 + ε )-approximation for DFSM. Begin by scaling the edge weights so that the minimum s - t cutin ˆ G is at least one. This can be done by finding an augmenting flow path from s to t and scalingedge weights so that the path has a capacity of at least 1 on all edges.If the graph has irrational edge weights, we can perform a standard scaling procedure to turnit into a directed graph with integer edge weights, in a way that guarantees we do not lose muchin the approximation factor. This is done by adjusting all edge weights by up to an additive term ε /M to reach a nearby rational number, producing a graph ˜ G = ( ˆ V , ˜ E ). Let cut ˆ G and cut ˜ G be In some cases, runtimes for finding solutions to within an additive error of optimality have been considered [20],but these are not directly comparable to our multiplicative approximation guarantees. Furthermore, these runtimesonly improve in terms of logarithmic factors when an approximate solution is returned rather than an optimal one. G and ˜ G respectively. The graphs have the same set of nodes and edges, but mayhave different edge weights. Let w ij > i, j ) in ˆ G . For any S ⊆ ˆ V , let ∂S be the set of edges cut by S . We have cut ˆ G ( S ) ≤ cut ˜ G ( S ) ≤ (cid:88) ( i,j ) ∈ ∂S [ w ij + ε /M ] ≤ cut ˆ G ( S ) + ε ≤ (1 + ε ) cut ˆ G ( S ) . Finally, since all edge weights in ˜ G are rational, we can scale them up to be integers. Goldberg andRao [27] provide a method for finding a (1 + ε )-approximate minimum s - t cut in a directed graphwith N nodes and M edges in time O ( M · min { M / , N / } log( M /N ) log M/ε ), which does notdepend on the largest edge weight. Overall we performed three levels of approximation: approxi-mately reducing the hypergraph, approximating the cut properties of ˆ G with ˜ G , and approximatingthe s - t cut solution in ˜ G . Since ε = ε/
7, the overall approximation factor is (1 + ε ) ≤ (1 + ε ).Plugging in the appropriate values for M and N yields the runtime guarantee. Our approximate solution techniques will always be faster than runtimes obtained by methods thatperform an exact reduction to a graph s - t cut problem [37, 66], since these introduce O ( | e | ) edgesfor each hyperedge e . Kolmogorov [39] presented an algorithm for minimizing sums of submodularfunctions based on submodular flows. Although the approach provides a way to also solve moregeneral variants of the problem, the algorithm has a runtime of O (( n + µ ) log F max ) = ˜ O ( µ )specifically in the case of cardinality-based functions with integer-valued weights, which is slowerthan our approximate techniques by at least a factor of µ / .Recently, Ene et al. [20] presented improved runtime analyses for optimization techniques forsolving DFSM. The runtimes depend on the time it takes to evaluate oracle functions that corre-spond to solving a submodular minimization problem at a single splitting function. Let θ e be thetime it take to evaluate the oracle at e ∈ E , and define θ max = max e ∈E θ e and θ avg = R (cid:80) e ∈E θ e .For a cardinality-based function f e , such an oracle can be queried in O ( | e | log | e | ) time [33], and so θ avg = O ( R (cid:80) e ∈E | e | log | e | ). Combing this oracle with the runtimes presented by Ene at al. [20]produces the fastest known runtimes for cardinality-based DFSM. For methods based on discreteoptimization, Ene et al. [20] note that the incremental breadth first search (IBFS) algorithm ofFix et al. [22] can be implemented with a strongly polynomial runtime of O ( n θ max (cid:80) e ∈E | e | ),or a weakly polynomial runtime of O ( n θ max log F max + n (cid:80) e ∈E | e | θ i ) = ˜ O ( n θ max + n (cid:80) e ∈E | e | ).Among continuous optimization approaches, the best runtime presented by Ene et al. [20] is achievedby the accelerated random coordinate descent method (ACDM) [21]. The method has a weaklypolynomial runtime of O ( nRθ avg log( nF max )) = ˜ O ( nµ ). The hyperedge e defines the support of the function f e . Previous research has addressed both thecase of functions f e of large support and the case of small support (see discussion and experimentsin [20, 39, 45, 59]). Functions of small support are common in computer vision applications, thoughthe case of large support has also been studied [59]. In hypergraph cut problems, large hyperedgesare natural for modeling large-scale multiway interactions (as discussed in Section 5 with respect toco-occurrence data). Table 1 summarizes runtimes for the methods we have considered here, withspecialized runtimes highlighted for different regimes. The table shows runtimes for small support, k avg = O (1), in which case R = Ω( n ) and also runtimes for k avg = Θ( n ), which highlights the27lowest possible runtime for each method in terms of nodes and hyperedges. For the latter case,Table 1 additionally distinguishes between subcases where R = Ω( n ) and n = Ω( R ).For subcases of each of these regimes, our sparsification techniques lead to improved runtimeswhen searching for approximately optimal solutions. When k avg = O (1), our runtime is the fastestas long as R = o ( n ). When k avg = Θ( n ) and R = Ω( n ), the ACDM algorithm has the bestperformance whenever R = Ω( n / ), though we provide a faster approximate alternative belowthis threshold. Most importantly, our method provides a significantly faster runtime for the casewhere n = Ω( R ), independent of k avg . Thus, in hypergraphs with a sublinear (in n ) number oflarge hyperedges (equivalently, a sum of sublinearly many functions of large support), obtainingan approximately optimal solution is much faster than solving the problem exactly with existingtechniques. We have introduced the notion of an augmented cut sparsifier, which approximates a generalizedhypergraph cut function with a sparse directed graph on an augmented node set. Our approachrelies on a connection we highlight between graph reduction strategies and piecewise linear ap-proximations to concave functions. Our framework leads to more efficient techniques for approx-imating hypergraph s - t cut problems via graph reduction, improved sparsifiers for co-occurrencegraphs, and fast algorithms for approximately minimizing cardinality-based decomposable submod-ular functions.As noted in Section 1.2, an interesting open question is to establish and study analogous notionsof augmented spectral sparsification, given that spectral sparsifiers provide a useful generalizationof cut sparsifiers in graphs [61]. One way to define such a notion is to apply existing definitionsof submodular hypergraph Laplacians [46, 70] to both the original hypergraph and its sparsifier.This requires viewing our augmented sparsifier as a hypergraph with splitting functions of theform w e ( A ) = a · min {| A | , | e \ A | , b } , corresponding to hyperedges with cut properties that can bemodeled by a cardinality-based gadget. From this perspective, augmented spectral sparsificationmeans approximating a generalized hypergraph cut function with another hypergraph cut functioninvolving simplified splitting functions. While this provides one possible definition for augmentedspectral sparsification, it is not clear whether the techniques we have developed can be used to satisfythis definition. Furthermore, it is not clear whether obtaining such a sparsifier would imply anyimmediate runtime benefits for approximating the spectra of generalized hypergraph Laplacians,or for solving generalized Laplacian systems [25, 43]. We leave these as questions for future work.While our work provides the optimal reduction strategy in terms of cardinality-based gadgets,this is more restrictive than optimizing over all possible gadgets for approximately modeling hyper-edge cut penalties. Optimizing over a broader space of gadgets poses another interesting directionfor future work, but is more challenging in several ways. First of all, it is unclear how to even definean optimal reduction when optimizing over arbitrary gadgets, since it is preferable to avoid bothadding new nodes and adding new edges, but the tradeoff between these two goals is not clear.Another challenge is that the best reduction may depend heavily on the splitting function we wishto reduce, which makes developing a general approach difficult. A natural next step would be toat least better understand lower bounds on the number of edges and auxiliary nodes needed tomodel different cardinality-based splitting functions. While we do not have any concrete results,there are several indications that cardinality-based gadgets may be nearly optimal in many settings.For example, star expansions and clique expansions provide a more efficient way to model linearand quadratic splitting functions respectively, but modeling these functions with cardinality-based28adgets only increases the number of edges by roughly a factor two.Finally, we find it interesting that using auxiliary nodes and directed edges makes it possibleto sparsify the complete graph using only O ( nε − / log log ε ) edges, whereas standard sparsifiersrequire O ( nε − ). We would like to better understand whether both directed edges and auxiliarynodes are necessary for making this possible, or whether improved approximations are possibleusing only one or the other. A Proofs of Lemmas in Section 3
Proof of Lemma 3.1Lemma.
Let ˆ f be the continuous extension for a function w ∈ S r , shown in (21) . This function isin the class F r , and has exactly J positive-sloped linear pieces, and one linear piece of slope zero.Proof. Define b = 0 for notational convenience. The first three conditions in Definition 3.1 can beseen by inspection, recalling that 0 < a j and 0 < b j ≤ r for all j ∈ [ J ]. Observe that ˆ f is linearover the interval [ b i − , b i ) for i ∈ [ J ], since for x ∈ [ b i − , b i ),ˆ f ( x ) = J (cid:88) j =1 a j · min { x, b j } = i − (cid:88) j =1 a j b j + x · J (cid:88) j = i a j . In other words, the i th linear piece of ˆ f , defined over x ∈ [ b i − , b i ) is given by ˆ f ( i ) ( x ) = I i + S i x, where the intercept and slope terms are given by I i = (cid:80) i − j =1 a j b j and S i = (cid:80) Jj = i a j . For the first J intervals of the form [ b i − , b i ), the slopes are always positive but strictly decreasing. Thus, there areexactly J positive sloped linear pieces. The final linear piece is a flat line, since ˆ f ( x ) = (cid:80) Jj =1 a j b j for all x ≥ b J . The concavity of ˆ f follows directly from the fact that it is a continuous and piecewiselinear function with decreasing slopes. Proof of Lemma 3.2Lemma.
Let f be a function in F r with J + 1 linear pieces. Let b i denote the i th breakpoint of f ,and m i denote the slope of the i th linear piece of f . Define vectors a , b ∈ R J where b ( i ) = b i and a ( i ) = a i = m i − m i +1 for i ∈ [ J ] . If w is the r -CCB function parameterized by vectors ( a , b ) , then f is the continuous extension of w .Proof. Since f is in F r , it has J positive-sloped linear pieces and one flat linear piece, and thereforeit has exactly J breakpoints: 0 < b < b < . . . < b J . Let b = ( b j ) be the vector storing thesebreakpoints. For convenience we define b = 0, though b is not stored in b . By definition, f isconstant for all x ≥ r , which implies that b J ≤ r .Let f i = f ( b i ). For i ∈ [ J ], the positive slope of the i th linear piece of f , which occurs in therange [ b i − , b i ], is given by m i = f i − f i − b i − b i − . (45)The i th linear piece of f is given by f ( i ) ( x ) = m i ( x − b i − ) + f i − for x ∈ [ b i − , b i ] . (46)The last linear piece of f is a flat line over the interval x ∈ [ b J , ∞ ), i.e., m J +1 = 0. Since f haspositive and strictly decreasing slopes, we can see that a i = m i − m i +1 > i ∈ [ J ].29et w be the order- J CCB function constructed from vectors ( a , b ), and let ˆ f be its resultingcontinuous extension: ˆ f = J (cid:88) j =1 a j · min { x, b j } . (47)We must check that ˆ f = f . By Lemma 3.1, we know that ˆ f is in F r and has exactly J + 1 linearpieces. The functions will be the same, therefore, if they share the same values at breakpoints.Evaluating ˆ f at an arbitrary breakpoint b i gives:ˆ f ( b i ) = i − (cid:88) j =1 a j · b j + b i · J (cid:88) j = i a j = i − (cid:88) j =1 a j · b j + b i · m i . (48)We first confirm that the functions coincide at the first breakpoint:ˆ f ( b ) = b · m = b · f − f b − b = b f b = f . For any fixed i ∈ { , , . . . , J } ,ˆ f ( b i ) − ˆ f ( b i − ) = i − (cid:88) j =1 a j b j + b i m i − i − (cid:88) j =1 a j b j − b i − m i − = a i − b i − + b i m i − b i − m i − = ( m i − − m i ) b i − + b i m i − b i − m i − = m i ( b i − b i − ) = f i − f i − . Since f ( b ) = ˆ f ( b ) and f ( b i ) − f ( b i − ) = ˆ f ( b i ) − ˆ f ( b i − ) for i ∈ { , , . . . , t } , we have f ( b i ) = ˆ f ( b i )for i ∈ [ J ]. Therefore, f and ˆ f are the same piecewise linear function. B Sparsification for Generalized Splitting Functions
In Sections 2 and 3 we focused on sparsification techniques for representing splitting functions thatare symmetric and penalize only cut hyperedges: w e ( S ) = w ( e \ S ) for all S ⊆ e w e ( e ) = w e ( ∅ ) = 0 . These assumptions are standard for generalized hypergraph cut problems [44, 46, 66], and lead tothe clearest exposition of our main results. In this appendix, we extend our sparse approximationtechniques so that they apply even if we remove these restrictions. This will allow us to obtain im-proved techniques for approximately solving a certain class of decomposable submodular functions(see Section 6). Formally, our goal is to minimizeminimize S ⊆ V f ( S ) = (cid:88) e ∈E w e ( S ∩ e ) , (49)where each w e is a submodular cardinality-based function, that is not necessarily symmetric anddoes not need to equal zero when the hyperedge e is uncut. Our proof strategy for reducing thismore general problem to a graph s - t cut problem closely follows the same basic set of steps used inSection 3 for the special case. 30 .1 Submodularity Constraints for Cardinality-Based Functions We first provide a convenient characterization of general cardinality-based submodular functions.By general we mean the splitting function does not need to be symmetric nor does it need to havea zero penalty when the hyperedge is uncut.
Lemma B.1.
Let w e be a general submodular cardinality-based splitting function on a k -nodehyperedge e , and let w i denote the penalty for any A ⊆ e with | A | = i . Then for i ∈ { , , . . . , k − } w i ≥ w i − + w i +1 . (50) Proof.
Let v , v , . . . , v k denote the nodes in the hyperedge. Submodularity means that for all A, B ⊆ e , w ( A ) + w ( B ) ≥ w ( A ∪ B ) + w ( A ∩ B ). In order to show inequality (50), simply set A = { v , v , . . . , v i } and B = { v , v , . . . , v i +1 } and the result follows.To simplify our analysis, as we did for the symmetric case, we will define a set of functions thatis virtually identical to these splitting functions on k -node hyperedges, but are defined over integersfrom 0 to k rather than on subsets of a hyperedge. Definition B.1. A k -GSCB (Generalized Submodular Cardinality-Based) integer function is afunction w : { } ∪ [ k ] → R + satisfying w ( i ) ≥ w ( i −
1) + w ( i + 1) for all i ∈ [ k − . B.2 Combining Gadgets for Generalized SCB Functions
Our goal is to show how to approximate k -GSCB integer functions using piecewise linear functionswith few linear pieces. This in turn corresponds to approximating a hyperedge splitting functionwith a sparse gadget. In order for this to work for our more general class of splitting functions, weuse a slight generalization of an asymmetric gadget we introduced in previous work [66]. Definition B.2.
The asymmetric cardinality-based gadget (ACB-gadget) for a k -node hyperedge e is parameterized by scalars a and b and constructed as follows: • Introduce an auxiliary vertex v e . • For each v ∈ e , introduce a directed edge from v to v e with weight a · ( k − b ) , and a directededge from v e to v with weight a · b . The ACB-gadget models the following k -GSCB integer function: w a,b ( i ) = a · min { i · ( k − b ) , ( k − i ) · b } . (51)To see why, consider where we must place the auxiliary node v e when solving a minimum s - t cutproblem involving the ACB-gadget. If we place i nodes on the s -side, then placing v e on the s -sidehas a cut penalty of ab ( k − i ), whereas placing v e on the t -side gives a penalty of ai ( k − b ). Tominimize the cut, we choose the smaller of the two options.Previously we showed that asymmetric splitting functions can be modeled exactly by a com-bination of k − w e ( ∅ ) = w e (0) = 0 even for asymmetric splitting functions, but we remove this constrainthere. In order to model the cut properties of an arbitrary GSCB splitting function, we define acombined gadget involving multiple ACB-gadgets, as well as edges from each node v ∈ e to the31ource and sink nodes of the graph. The augmented cut function for the resulting directed graphˆ G = ( V ∪ A ∪ { s, t } , ˆ E ) will then be given by cut ˆ G ( S ) = min T ⊆A dircut ˆ G ( { s } ∪ S ∪ T ) for aset S ⊆ V , where dircut is the directed cut function on ˆ G . Finding a minimum s - t cut in ˆ G will solve objective (49), or equivalently, the cardinality-based decomposable submodular functionminimization problem. Definition B.3. A k -CG function ( k -node, combined gadget function) ˆ w of order J is a k -GSCBinteger function that is parameterized by scalars z , z k , and ( a j , b j ) for j ∈ [ J ] . The function hasthe form: ˆ w ( i ) = z · ( k − i ) + z k · i + J (cid:88) j =1 a j min { i · ( k − b j ) , ( k − i ) · b j } . (52) The scalars parameterizing ˆ w satisfy b j > , a j > for all j ∈ [ J ] b j < b j +1 for all j ∈ [ J − b J < kz ≥ and z k ≥ . Conceptually, the function shown in (52) represents a combination of J ACB-gadgets for ahyperedge e , where additionally for each node v ∈ e we have place a directed edge from a sourcenode s to v of weight z , and an edge from v to a sink node t with weight z k .The continuous extension of the k -CG function (52) is defined to be:ˆ f ( x ) = z · ( k − x ) + z k · x + J (cid:88) j =1 a j min { x · ( k − b j ) , ( k − x ) · b j } for x ∈ [0 , k ]. (53) Lemma B.2.
The continuous extension ˆ f of ˆ w is nonnegative over the interval [0 , k ] , piecewiselinear, concave, and has exactly J + 1 linear pieces.Proof. Nonnegativity follows quickly from the positivity of z , z k , and ( a i , b i ) for i ∈ [ J ], and b J < k . For other properties, we begin by re-writing the function asˆ f ( x ) = z · ( k − x ) + z k · x + J (cid:88) j =1 a j min { x · ( k − b j ) , ( k − x ) · b j } (54)= kz + x ( z k − z ) + k · J (cid:88) j =1 a j min { x, b j } − x · J (cid:88) j =1 a j b j (55)= kz + x ( z k − z ) + kx · (cid:88) j : x
For every function f that is nonnegative, piecewise linear with J + 1 linear pieces,and concave over the interval [0 , k ] , there exists some k -CG function w of order J such that f isthe continuous extension of w .Proof. The function w will be defined by choosing parameters z , z k , and ( a j , b j ) for j ∈ [ J ]. Let ˆ f denote the continuous extension of the function w that we will build. From the proof of Lemma B.2,we know that the parameter b j will correspond to the j th breakpoint of ˆ f . Therefore, given f , weset b j to be the j th breakpoint of the function f , so that the functions match at breakpoints. Forconvenience, we also set b = 0 and b J +1 = k . We then set z = f (0) /k and z k = f ( k ) /k , toguarantee that ˆ f (0) = f (0) and ˆ f ( k ) = f ( k ). In order to set the a j values, we first compute theslopes of each line of f . Let f j = f ( b j ) for j ∈ { } ∪ [ J + 1]. The j th linear piece of f has the slope: m i = f i − f i − b i − b i − . Finally, for j ∈ [ J ] we set a j = k ( m j − m j +1 ). All of our chosen parameters satisfy the conditionsof Definition B.3, so it simply remains to check that f and ˆ f coincide at breakpoints.Let t ∈ [ J ]. Using (56) to evaluating ˆ f at breakpoint b t , we getˆ f ( b t ) = f + b t k ( f k − f ) + kb t J (cid:88) j = t +1 a j + k t (cid:88) j =1 a j b j − b t J (cid:88) j =1 a j b j . (57)We can simplify several terms using the fact that a j = k ( m j − m j +1 ). First of all, k J (cid:88) j = t +1 a j = J (cid:88) j = t +1 [ m j − m j +1 ] = m t +1 − m J +1 . Furthermore, k t (cid:88) j =1 a j b j = t (cid:88) j =1 ( m j − m j +1 ) b j = m b − m t +1 b t + t (cid:88) j =2 m j ( b j − b j − )= ( f − f ) − m t +1 b t + t (cid:88) j =2 [ f j − f j − ] = ( f − f ) − m t +1 b t + f t − f = f t − f − m t +1 b t . (cid:80) Jj =1 a j b j = k ( f J − f − m J +1 b J ). Plugging this into (57), we getˆ f ( b t ) = f + b t k ( f J +1 − f ) + b t ( m t +1 − m J +1 ) + f t − f − m t +1 b t − b t k ( f J − f − m J +1 b J )= b t k f J +1 − b t m J +1 + f t − b t k ( f J − m J +1 b J )= f t + b t k ( f J +1 − f J ) − b t m J +1 (cid:18) − b J k (cid:19) = f t + b t k ( f J +1 − f J ) − b t (cid:18) f J +1 − f J k − b J (cid:19) (cid:18) k − b J k (cid:19) = f t = f ( b t ) . So we see that f = ˆ f at breakpoints, and therefore these be the same piecewise linear function. B.3 Finding the Best Piecewise Approximation
As we did for symmetric splitting functions, we can quickly find the best piecewise linear (1 + ε )-approximation to a k -GSCB integer function w using a greedy approach. We omit proof details,as they exactly mirror arguments provided for the symmetric case. The submodularity constraint2 w ( i ) ≥ w ( i + 1) + w ( i −
1) for i ∈ { } ∪ [ k ] can be viewed as a discrete version of concavity, and willensure that the piecewise linear function returned by such a procedure will also be nonnegative andconcave. After obtaining the piecewise linear approximation, we can apply Lemma B.3 to reverseengineer a k -CG function of a small order that approximates w . We obtain the same asymptoticupper bound on the number of linear pieces needed to approximate w . Lemma B.4.
Let w be a k -GSCB integer function and ε ≥ . There exists a k -CG function ˆ w oforder J = O ( ε log k ) that satisfies w ( i ) ≤ ˆ w ( i ) ≤ (1 + ε ) w ( i ) for any i ∈ { } ∪ [ k ] . B.4 Approximating Cardinality-Based Sum of Submodular Functions
Recall that k -CG functions correspond to combinations of ACB-gadgets for a hyperedge e as wellas directed edges between nodes in e and the source and sink nodes in some minimum s - t cutproblem. Each ACB-gadget involves one new auxiliary node and 2 | e | directed edges, and thenumber of ACB-gadgets is equal to the order of the k -CG function (the number of linear piecesminus one). Let H = ( V, E ) be a hypergraph with n = | V | nodes, where each splitting function issubmodular, cardinality-based, and is not required to be symmetric or penalize only cut hyperedges.Finding the minimum cut in H corresponds to solving the sum of submodular splitting functionsgiven in (49). For ε ≥
0, we can preserve cuts in H to within a factor (1 + ε ) by introducinga source and sink node s and t and applying our sparse reduction techniques to each hyperedgeto obtain a directed graph ˆ G = ( V ∪ A ∪ { s, t } , ˆ E ), where A is the set of auxiliary nodes, with N = O ( n + ε (cid:80) e ∈E log | e | ) nodes and M = O ( n + ε (cid:80) e ∈E log | e | ) edges. Even if the size of each e ∈ E is O ( n ), we have N = O ( n + ε − |E| log n ) and M = O ( ε − log n (cid:80) e ∈E | e | ). References [1] Sameer Agarwal, Kristin Branson, and Serge Belongie. Higher order learning with graphs.In
Proceedings of the 23rd International Conference on Machine Learning , ICML ’06, pages17–24, New York, NY, USA, 2006. ACM. 342] Kadir. Akbudak, Enver. Kayaaslan, and Cevdet. Aykanat. Hypergraph partitioning basedmodels and methods for exploiting cache locality in sparse matrix-vector multiplication.
SIAMJournal on Scientific Computing , 35(3):C237–C262, 2013.[3] Noga Alon. On the edge-expansion of graphs.
Comb. Probab. Comput. , 6(2):145–152, June1997.[4] Charles J Alpert and Andrew B Kahng. Recent directions in netlist partitioning: a survey.
Integration , 19(1):1 – 81, 1995.[5] Alexandr Andoni, Jiecao Chen, Robert Krauthgamer, Bo Qin, David P. Woodruff, and QinZhang. On sketching quadratic forms. In
Proceedings of the 2016 ACM Conference on Innova-tions in Theoretical Computer Science , ITCS ’16, pages 311–319, New York, NY, USA, 2016.Association for Computing Machinery.[6] Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. Hypergraph partition-ing for sparse matrix-matrix multiplication.
ACM Trans. Parallel Comput. , 3(3):18:1–18:34,December 2016.[7] N. Bansal, O. Svensson, and L. Trevisan. New notions and constructions of sparsification forgraphs and hypergraphs. In , FOCS ’19, pages 910–928, 2019.[8] Joshua Batson, Daniel A. Spielman, and Nikhil Srivastava. Twice-ramanujan sparsifiers.
SIAMReview , 56(2):315–334, 2014.[9] Andr´as A Bencz´ur and David R Karger. Approximating s – t minimum cuts in ˜ O ( n ) time. In Proceedings of the twenty-eighth annual ACM Symposium on Theory of computing , STOC ’96,pages 47–55, 1996.[10] Austin R. Benson, David F. Gleich, and Jure Leskovec. Higher-order organization of complexnetworks.
Science , 353(6295):163–166, 2016.[11] Austin R. Benson, Paul Liu, and Hao Yin. A simple bipartite graph projection model forclustering in networks. arXiv preprint: https://arxiv.org/abs/2007.00761 , 2020.[12] Mindaugas Bloznelis et al. Degree and clustering coefficient in sparse random intersectiongraphs.
The Annals of Applied Probability , 23(3):1254–1289, 2013.[13] Mindaugas Bloznelis and Justinas Petuchovas. Correlation between clustering and degree inaffiliation networks. In
International Workshop on Algorithms and Models for the Web-Graph ,pages 90–104. Springer, 2017.[14] T. H. Hubert Chan and Zhibin Liang. Generalizing the hypergraph laplacian via a diffusionprocess with mediators. In
Computing and Combinatorics , pages 441–453. Springer Interna-tional Publishing, 2018.[15] Karthekeyan Chandrasekaran, Chao Xu, and Xilin Yu. Hypergraph k-cut in randomizedpolynomial time. In
Proceedings of the 2018 Annual ACM-SIAM Symposium on Discrete Al-gorithms , SODA ’18, pages 1426–1438, USA, 2018. Society for Industrial and Applied Mathe-matics. 3516] Chandra Chekuri and Chao Xu. Computing minimum cuts in hypergraphs. In
Proceedings ofthe 2017 Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’17, pages 1085–1100,2017.[17] Chandra Chekuri and Chao Xu. Minimum cuts and sparsification in hypergraphs.
SIAMJournal on Computing , 47(6):2118–2156, 2018.[18] Julia Chuzhoy. On vertex sparsifiers with steiner nodes. In
Proceedings of the Forty-FourthAnnual ACM Symposium on Theory of Computing , STOC ’12, pages 673–688, New York, NY,USA, 2012. Association for Computing Machinery.[19] Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Power-law distributions inempirical data.
SIAM Review , 51(4):661–703, 2009.[20] Alina Ene, Huy Nguyen, and L´aszl´o A V´egh. Decomposable submodular function minimiza-tion: discrete and continuous. In
Advances in Neural Information Processing Systems , NeurIPS’17, pages 2870–2880, 2017.[21] Alina Ene and Huy L. Nguyen. Random coordinate descent methods for minimizing decom-posable submodular functions. In
Proceedings of the 32nd International Conference on Inter-national Conference on Machine Learning - Volume 37 , ICML’15, pages 787–795. JMLR.org,2015.[22] A. Fix, T. Joachims, S. M. Park, and R. Zabih. Structured learning of sum-of-submodularhigher order energy functions. In ,pages 3104–3111, 2013.[23] Kyle Fox, Debmalya Panigrahi, and Fred Zhang. Minimum cut and minimum k -cut in hyper-graphs via branching contractions. In Proceedings of the 2019 Annual ACM-SIAM Symposiumon Discrete Algorithms , SODA ’19, pages 881–896, 2019.[24] D. Freedman and P. Drineas. Energy minimization via graph cuts: settling what is possible.In ,CVPR ’05, 2005.[25] Kaito Fujii, Tasuku Soma, and Yuichi Yoshida. Polynomial-time algorithms for submodularlaplacian systems. arXiv preprint: 1803.10923 , 2018.[26] Junhao Gan, David F. Gleich, Nate Veldt, Anthony Wirth, and Xin Zhang. Graph clustering inall parameter regimes. In
International Symposium on Mathematical Foundations of ComputerScience , MFCS ’20, 2020.[27] Andrew V. Goldberg and Satish Rao. Beyond the flow decomposition barrier.
J. ACM ,45(5):783–797, September 1998.[28] Scott W. Hadley. Approximation techniques for hypergraph partitioning problems.
DiscreteApplied Mathematics , 59(2):115 – 127, 1995.[29] Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The totalvariation on hypergraphs - learning on hypergraphs revisited. In
Proceedings of the 26thInternational Conference on Neural Information Processing Systems , NeurIPS ’13, pages 2427–2435, 2013. 3630] Jin Huang, Rui Zhang, and Jeffrey Xu Yu. Scalable hypergraph learning and processing. In
Proceedings of the 2015 IEEE International Conference on Data Mining , ICDM ’15, pages775–780, Washington, DC, USA, 2015. IEEE Computer Society.[31] Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomial algo-rithm for minimizing submodular functions.
J. ACM , 48(4):761–777, July 2001.[32] Satoru Iwata and James B. Orlin. A simple combinatorial algorithm for submodular functionminimization. In
Proceedings of the 2009 Annual ACM-SIAM Symposium on Discrete Algo-rithms , SODA ’09, pages 1230–1237, Philadelphia, PA, USA, 2009. Society for Industrial andApplied Mathematics.[33] Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly submod-ular optimization. In
Proceedings of the 26th International Conference on Neural InformationProcessing Systems , NeurIPS ’13, pages 1313–1321, 2013.[34] Stefanie Jegelka, Hui Lin, and Jeff A Bilmes. On fast approximate submodular minimization.In
Advances in Neural Information Processing Systems , NeurIPS ’11, pages 460–468, 2011.[35] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: ap-plications in vlsi domain.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ,7(1):69–79, March 1999.[36] Dmitry Kogan and Robert Krauthgamer. Sketching cuts in graphs and hypergraphs. In
Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science , ITCS’15, pages 367–376, New York, NY, USA, 2015. Association for Computing Machinery.[37] Pushmeet Kohli, Philip HS Torr, et al. Robust higher order potentials for enforcing labelconsistency.
International Journal of Computer Vision , 82(3):302–324, 2009.[38] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts?
IEEETransactions on Pattern Analysis and Machine Intelligence , 26(2):147–159, Feb 2004.[39] Vladimir Kolmogorov. Minimizing a sum of submodular functions.
Discrete Appl. Math. ,160(15):2246–2258, October 2012.[40] Silvio Lattanzi and D Sivakumar. Affiliation networks. In
Proceedings of the forty-first annualACM symposium on Theory of Computing , pages 427–434, 2009.[41] E. L. Lawler. Cutsets and partitions of hypergraphs.
Networks , 3(3):275–285, 1973.[42] Menghui Li, Jinshan Wu, Dahui Wang, Tao Zhou, Zengru Di, and Ying Fan. Evolving modelof weighted networks inspired by scientific collaboration networks.
Physica A: Statistical Me-chanics and its Applications , 375(1):355 – 364, 2007.[43] Pan Li, Niao He, and Olgica Milenkovic. Quadratic decomposable submodular function mini-mization: Theory and practice.
Journal of Machine Learning Research , 21(106):1–49, 2020.[44] Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. In
Advances in Neural Information Processing Systems 30 , NeurIPS ’17, pages 2308–2318. 2017.[45] Pan Li and Olgica Milenkovic. Revisiting decomposable submodular function minimizationwith incidence relations. In
Advances in Neural Information Processing Systems 31 , NeurIPS’18, pages 2237–2247, 2018. 3746] Pan Li and Olgica Milenkovic. Submodular hypergraphs: p-laplacians, Cheeger inequalitiesand spectral clustering. In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35thInternational Conference on Machine Learning , volume 80 of
ICML ’18 , pages 3014–3023.PMLR, 2018.[47] Anand Louis. Hypergraph markov operators, eigenvalues and approximation algorithms. In
Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing , STOC’15, pages 713–722, New York, NY, USA, 2015. Association for Computing Machinery.[48] A. Lubotzky. Ramanujan graphs.
Combinatorica , 8:261–278, 1988.[49] Thomas L. Magnanti and Dan Stratila. Separable concave optimization approximately equalspiecewise linear optimization. In
IPCO 2004 , pages 234–243, 2004.[50] Thomas L Magnanti and Dan Stratila. Separable concave optimization approximately equalspiecewise-linear optimization. arXiv preprint arXiv:1201.3148 , 2012.[51] Grigorii Aleksandrovich Margulis. Explicit group-theoretical constructions of combinatorialschemes and their application to the design of expanders and concentrators.
Problemy peredachiinformatsii , 24(1):51–60, 1988.[52] Zachary Neal. The backbone of bipartite projections: Inferring relationships from co-authorship, co-sponsorship, co-attendance and other co-behaviors.
Social Networks , 39:84– 97, 2014.[53] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degreedistributions and their applications.
Phys. Rev. E , 64:026118, Jul 2001.[54] Robert Nishihara, Stefanie Jegelka, and Michael I. Jordan. On the convergence rate of decom-posable submodular function minimization. In
Proceedings of the 27th International Conferenceon Neural Information Processing Systems , NeurIPS ’14, pages 640–648, 2014.[55] James B. Orlin. A faster strongly polynomial time algorithm for submodular function mini-mization.
Mathematical Programming , 118(2):237–251, May 2009.[56] Pulak Purkait, Tat-Jun Chin, Alireza Sadri, and David Suter. Clustering with hypergraphs:the case for large hyperedges.
IEEE transactions on pattern analysis and machine intelligence ,39(9):1697–1711, 2016.[57] Jos´e J. Ramasco and Steven A. Morris. Social inertia in collaboration networks.
Phys. Rev.E , 73:016122, Jan 2006.[58] J.A. Rodr´ıguez. Laplacian eigenvalues and partition problems in hypergraphs.
Applied Math-ematics Letters , 22(6):916 – 921, 2009.[59] I. Shanu, C. Arora, and P. Singla. Min norm point algorithm for higher order mrf-mapinference. In , CVPR ’16,pages 5365–5374, 2016.[60] Tasuku Soma and Yuichi Yoshida. Spectral sparsification of hypergraphs. In
Proceedings of the2019 Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’19, pages 2570–2581,2019. 3861] Daniel A. Spielman and Shang-Hua Teng. Spectral sparsification of graphs.
SIAM Journal onComputing , 40(4):981–1025, 2011.[62] Daniel A. Spielman and Shang-Hua Teng. Nearly linear time algorithms for preconditioningand solving symmetric, diagonally dominant linear systems.
SIAM Journal on Matrix Analysisand Applications , 35(3):835–885, 2014.[63] Domenico De Stefano, Vittorio Fuccella, Maria Prosperina Vitale, and Susanna Zaccarin. Theuse of different data sources in the analysis of co-authorship networks and scientific perfor-mance.
Social Networks , 35(3):370 – 381, 2013.[64] Peter Stobbe and Andreas Krause. Efficient minimization of decomposable submodular func-tions. In
Proceedings of the 23rd International Conference on Neural Information ProcessingSystems , NeurIPS ’10, pages 2208–2216, 2010.[65] A. Vannelli and S. W. Hadley. A gomory-hu cut tree representation of a netlist partitioningproblem.
IEEE Transactions on Circuits and Systems , 37(9):1133–1139, Sep. 1990.[66] Nate Veldt, Austin R. Benson, and Jon Kleinberg. Hypergraph cuts with general splittingfunctions. arXiv preprint: 2001.02817 , 2020.[67] Nate Veldt, Austin R. Benson, and Jon Kleinberg. Minimizing localized ratio cut objectives inhypergraphs. In
Proceedings of the 26th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining (to appear) , KDD ’20, 2020.[68] Nate Veldt, Anthony Wirth, and David F. Gleich. Parameterized correlation clustering inhypergraphs and bipartite graphs. In
Proceedings of the 26th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (to appear) , KDD ’20, 2020.[69] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’ networks. nature ,393(6684):440, 1998.[70] Yuichi Yoshida. Cheeger inequalities for submodular transformations. In
Proceedings of the2019 Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’19, pages 2582–2601,2019.[71] Dengyong Zhou, Jiayuan Huang, and Bernhard Sch¨olkopf. Learning with hypergraphs: Clus-tering, classification, and embedding. In
Proceedings of the 19th International Conference onNeural Information Processing Systems , NeurIPS ’06, pages 1601–1608, 2006.[72] Tao Zhou, Jie Ren, Mat´u ˇs Medo, and Yi-Cheng Zhang. Bipartite network projection andpersonal recommendation.
Phys. Rev. E , 76:046115, Oct 2007.[73] J. Y. Zien, M. D. F. Schlag, and P. K. Chan. Multilevel spectral hypergraph partitioning witharbitrary vertex sizes.
IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , 18(9):1389–1399, Sep. 1999.[74] Stanislav ˇZivn´y, David A. Cohen, and Peter G. Jeavons. The expressive power of binarysubmodular functions. In Rastislav Kr´aloviˇc and Damian Niwi´nski, editors,