The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
Matthias Hein, Simon Setzer, Leonardo Jost, Syama Sundar Rangapuram
TThe Total Variation on Hypergraphs - Learning onHypergraphs Revisited
Matthias Hein, Simon Setzer, Leonardo Jost and Syama Sundar Rangapuram
Department of Computer ScienceSaarland University
Abstract
Hypergraphs allow one to encode higher-order relationships in data and are thus avery flexible modeling tool. Current learning methods are either based on approx-imations of the hypergraphs via graphs or on tensor methods which are only appli-cable under special conditions. In this paper, we present a new learning frameworkon hypergraphs which fully uses the hypergraph structure. The key element is afamily of regularization functionals based on the total variation on hypergraphs.
Graph-based learning is by now well established in machine learning and is the standard way to dealwith data that encode pairwise relationships. Hypergraphs are a natural extension of graphs whichallow to model also higher-order relations in data. It has been recognized in several applicationareas such as computer vision [1, 2], bioinformatics [3, 4] and information retrieval [5, 6] that suchhigher-order relations are available and help to improve the learning performance.Current approaches in hypergraph-based learning can be divided into two categories. The first oneuses tensor methods for clustering as the higher-order extension of matrix (spectral) methods forgraphs [7, 8, 9]. While tensor methods are mathematically quite appealing, they are limited to so-called k -uniform hypergraphs, that is, each hyperedge contains exactly k vertices. Thus, they are notable to model mixed higher-order relationships. The second main approach can deal with arbitraryhypergraphs [10, 11]. The basic idea of this line of work is to approximate the hypergraph via a stan-dard weighted graph. In a second step, one then uses methods developed for graph-based clusteringand semi-supervised learning. The two main ways of approximating the hypergraph by a standardgraph are the clique and the star expansion which were compared in [12]. One can summarize [12]by stating that no approximation fully encodes the hypergraph structure. Earlier, [13] have proventhat an exact representation of the hypergraph via a graph retaining its cut properties is impossible.In this paper, we overcome the limitations of both existing approaches. For both clustering and semi-supervised learning the key element, either explicitly or implicitly, is the cut functional. Our aim isto directly work with the cut defined on the hypergraph. We discuss in detail the differences of thehypergraph cut and the cut induced by the clique and star expansion in Section 2.1. Then, in Section2.2, we introduce the total variation on a hypergraph as the Lovasz extension of the hypergraphcut. Based on this, we propose a family of regularization functionals which interpolate betweenthe total variation and a regularization functional enforcing smoother functions on the hypergraphcorresponding to Laplacian-type regularization on graphs. They are the key for the semi-supervisedlearning method introduced in Section 3. In Section 4, we show in line of recent research [14, 15, 16,17] that there exists a tight relaxation of the normalized hypergraph cut. In both learning problems,convex optimization problems have to be solved for which we derive scalable methods in Section5. The main ingredients of these algorithms are proximal mappings for which we provide a novelalgorithm and analyze its complexity. In the experimental section 6, we show that fully incorporatinghypergraph structure is beneficial. All proofs are moved to the supplementary material. a r X i v : . [ s t a t . M L ] D ec The Total Variation on Hypergraphs
A large class of graph-based algorithms in semi-supervised learning and clustering is based eitherexplicitly or implicitly on the cut. Thus, we discuss first in Section 2.1 the hypergraph cut and thecorresponding approximations.In Section 2.2, we introduce in analogy to graphs, the total variationon hypergraphs as the Lovasz extension of the hypergraph cut.
Hypergraphs allow modeling relations which are not only pairwise as in graphs but involve multiplevertices. In this paper, we consider weighted undirected hypergraphs H = ( V, E, w ) where V isthe vertex set with | V | = n and E the set of hyperedges with | E | = m . Each hyperedge e ∈ E corresponds to a subset of vertices, i.e., to an element of V . The vector w ∈ R m contains foreach hyperedge e its non-negative weight w e . In the following, we use the letter H also for theincidence matrix H ∈ R | V |×| E | which is for i ∈ V and e ∈ E , H i,e = (cid:26) if i ∈ e, else. . The degreeof a vertex i ∈ V is defined as d i = (cid:80) e ∈ E w e H i,e and the cardinality of an edge e can be writtenas | e | = (cid:80) j ∈ V H j,e . We would like to emphasize that we do not impose the restriction that thehypergraph is k -uniform, i.e., that each hyperedge contains exactly k vertices.The considered class of hypergraphs contains the set of undirected, weighted graphs which is equiv-alent to the set of -uniform hypergraphs. The motivation for the total variation on hypergraphscomes from the correspondence between the cut on a graph and the total variation functional. Thus,we recall the definition of the cut on weighted graphs G = ( V, W ) with weight matrix W . Let C = V \ C denote the complement of C in V . Then, for a partition ( C, C ) , the cut is defined as cut G ( C, C ) = (cid:88) i,j : i ∈ C,j ∈ C w ij . This standard definition of the cut carries over naturally to a hypergraph H cut H ( C, C ) = (cid:88) e ∈ E : e ∩ C (cid:54) = ∅ , e ∩ C (cid:54) = ∅ w e . (1)Thus, the cut functional on a hypergraph is just the sum of the weights of the hyperedges which havevertices both in C and C . It is not biased towards a particular way the hyperedge is cut, that is, howmany vertices of the hyperedge are in C resp. C . This emphasizes that the vertices in a hyperedgebelong together and we penalize every cut of a hyperedge with the same value.In order to handle hypergraphs with existing methods developed for graphs, the focus in previousworks [11, 12] has been on transforming the hypergraph into a graph. In [11], they suggest usingthe clique expansion (CE), i.e., every hyperedge e ∈ H is replaced with a fully connected subgraphwhere every edge in this subgraph has weight w e | e | . This leads to the cut functional cut CE , cut CE ( C, C ) := (cid:88) e ∈ E : e ∩ C (cid:54) = ∅ , e ∩ C (cid:54) = ∅ w e | e | | e ∩ C | | e ∩ C | . (2)Note that in contrast to the hypergraph cut (1), the value of cut CE depends on the way each hyper-edge is cut since the term | e ∩ C | | e ∩ C | makes the weights dependent on the partition. In particular,the smallest weight is attained if only a single vertex is split off, whereas the largest weight is attainedif the partition of the hyperedge is most balanced. In comparison to the hypergraph cut, this leadsto a bias towards cuts that favor splitting off single vertices from a hyperedge which in our point ofview is an undesired property for most applications. We illustrate this with an example in Figure1, where the minimum hypergraph cut ( cut H ) leads to a balanced partition, whereas the minimumclique expansion cut ( cut CE ) not only cuts an additional hyperedge but is also unbalanced. This isdue to its bias towards splitting off single nodes of a hyperedge. Another argument against the cliqueexpansion is computational complexity. For large hyperedges the clique expansion leads to (almost)fully connected graphs which makes computations slow and is prohibitive for large hypergraphs.We omit the discussion of the star graph approximation of hypergraphs discussed in [12] as it isshown there that the star graph expansion is very similar to the clique expansion. Instead, we wantto recall the result of Ihler et al. [13] which states that in general there exists no graph with the samevertex set V which has for every partition ( C, C ) the same cut value as the hypergraph cut.2igure 1: Minimum hypergraph cut cut H vs. minimum cut of the clique expansion cut CE : For edgeweights w = w = 10 , w = w = 0 . and w = 0 . the minimum hypergraph cut is ( C , C ) which is perfectly balanced. Although cutting one hyperedge more and being unbalanced, ( C , C ) is the optimal cut for the clique expansion approximation.Finally, note that for weighted -uniform hypergraphs it is always possible to find a correspondinggraph such that any cut of the graph is equal to the corresponding cut of the hypergraph. Proposition 2.1.
Suppose H = ( V, E, w ) is a weighted -uniform hypergraph. Then, W ∈ R | V |×| V | defined as W = H diag( w ) H T defines the weight matrix of a graph G = ( V, W ) whereeach cut of G has the same value as the corresponding hypergraph cut of H .Proof. The cut value of a partition ( C, C ) of G is given as cut G ( C, C ) = 12 (cid:88) e ∈ E | e ∩ C || e ∩ C | w e . The product | e ∩ C || e ∩ C | takes the values if e is cut by C and zero otherwise. Because of thefactor , we thus get equivalence to the hypergraph cut. In this section, we define the total variation on hypergraphs. The key technical element is the Lovaszextension which extends a set function, seen as a mapping on V , to a function on R | V | . Definition 2.1.
Let ˆ S : 2 V → R be a set function with ˆ S ( ∅ ) = 0 . Let f ∈ R | V | , let V be orderedsuch that f ≤ f ≤ . . . ≤ f n and define C i = { j ∈ V | j > i } . Then, the Lovasz extension S : R | V | → R of ˆ S is given by S ( f ) = n (cid:88) i =1 f i (cid:16) ˆ S ( C i − ) − ˆ S ( C i ) (cid:17) = n − (cid:88) i =1 ˆ S ( C i )( f i +1 − f i ) + f ˆ S ( V ) . Note that for the characteristic function of a set C ⊂ V , we have S ( C ) = ˆ S ( C ) . It is well-known that the Lovasz extension S is a convex function if and only if ˆ S is submodular[18]. For graphs G = ( V, W ) , the total variation on graphs is defined as the Lovasz extension of thegraph cut [18] given as T V G : R | V | → R , TV G ( f ) = (cid:80) ni,j =1 w ij | f i − f j | . Proposition 2.2.
The total variation TV H : R | V | → R on a hypergraph H = ( V, E, w ) defined asthe Lovasz extension of the hypergraph cut, ˆ S ( C ) = cut H ( C, C ) , is a convex function given by TV H ( f ) = (cid:88) e ∈ E w e (cid:16) max i ∈ e f i − min j ∈ e f j (cid:17) = (cid:88) e ∈ E w e max i,j ∈ e | f i − f j | . roof. Using C i − = C i ∪ { i } and C i = C i − ∪ { i } the Lovasz extension can be written as TV H ( f ) = n (cid:88) i =1 f i (cid:16) cut( C i − , C i − ) − cut( C i , C i ) (cid:17) = n (cid:88) i =1 f i (cid:16) cut( { i } , C i − ) − cut( C i , { i } ) (cid:17) = n (cid:88) i =1 f i (cid:16) (cid:88) e ∈ E,i ∈ e e ∩{ ,...,i − }(cid:54) = ∅ w e − (cid:88) e ∈ E,i ∈ e e ∩{ i +1 ,...,n }(cid:54) = ∅ w e (cid:17) = (cid:88) e ∈ E w e (cid:16) max i ∈ e f i − min j ∈ e f j (cid:17) . It is easy to see that the Lovasz extension of the hypergraph cut is a convex function. Since themaximum of convex functions is convex, − min i ∈ e f i = max i ∈ e f i and the hyperedge weights arenon-negative, we have a non-negative combination of convex functions which is convex. Alter-natively, one could use that the hypergraph cut is submodular and the Lovasz extension of everysubmodular set function is convex.Note that the total variation of a hypergraph cut reduces to the total variation on graphs if H is -uniform (standard graph). There is an interesting relation of the total variation on hypergraphsto sparsity inducing group norms. Namely, defining for each edge e ∈ E the difference operator D e : R | V | → R | V |×| V | by ( D e f ) ij = f i − f j if i, j ∈ e and otherwise, TV H can be writtenas, TV H ( f ) = (cid:80) e ∈ E w e (cid:107) D e f (cid:107) ∞ , which can be seen as inducing group sparse structure on thegradient level. The groups are the hyperedges and thus are typically overlapping. This could leadpotentially to extensions of the elastic net on graphs to hypergraphs.It is known that using the total variation on graphs as a regularization functional in semi-supervisedlearning (SSL) leads to very spiky solutions for small numbers of labeled points. Thus, one wouldlike to have regularization functionals enforcing more smoothness of the solutions. For graphs thisis achieved by using the family of regularization functionals Ω G,p : R | V | → R , Ω G,p ( f ) = 12 n (cid:88) i,j =1 w ij | f i − f j | p . For p = 2 we get the regularization functional of the graph Laplacian which is the basis of a largeclass of methods on graphs. In analogy to graphs, we define a corresponding family on hypergraphs. Definition 2.2.
The regularization functionals Ω H,p : R | V | → R for a hypergraph H = ( V, E, w ) are defined for p ≥ as Ω H,p ( f ) = (cid:88) e ∈ E w e (cid:16) max i ∈ e f i − min j ∈ e f j (cid:17) p . Lemma 2.1.
The functionals Ω H,p : R | V | → R are convex.Proof. The p -th power of positive, convex functions for p ≥ is convex as (cid:0) f ( λx + (1 − λ ) y ) (cid:1) p ≤ (cid:0) λf ( x ) + (1 − λ ) f ( y ) (cid:1) p ≤ λf ( x ) p + (1 − λ ) f ( y ) p where the last inequality follows from the convexity of x p on R + . Thus, the p -th power of max i ∈ e f i − min j ∈ e f j is convex.Note that Ω H, ( f ) = TV H ( f ) . If H is a graph and p ≥ , Ω H,p reduces to the Laplacian regulariza-tion Ω G,p . Note that for characteristic functions of sets, f = C , it holds Ω H,p ( C ) = cut H ( C, C ) .Thus, the difference between the hypergraph cut and its approximations such as clique and starexpansion carries over to Ω H,p and Ω G CE ,p , respectively. With the regularization functionals derived in the last section, we can immediately write down aformulation for two-class semi-supervised learning on hypergraphs similar to the well-known ap-proaches of [19, 20]. Given the label set L we construct the vector Y ∈ R n with Y i = 0 if i / ∈ L Y i equal to the label in {− , } if i ∈ L . We propose solving f ∗ = arg min f ∈ R | V | (cid:107) f − Y (cid:107) + λ Ω H,p ( f ) , (3)where λ > is the regularization parameter. In Section 5, we discuss how this convex optimizationproblem can be solved efficiently for the case p = 1 and p = 2 . Note, that other loss functions thanthe squared loss could be used. However, the regularizer aims at contracting the function and weuse the label set {− , } so that f ∗ ∈ [ − , | V | . Hence, on the interval [ − , the squared lossbehaves very similar to other margin-based loss functions. In general, we recommend using p = 2 as it corresponds to Laplacian-type regularization for graphs which is known to work well. Forgraphs p = 1 is known to produce spiky solutions for small numbers of labeled points. This is dueto the effect that cutting “out” the labeled points leads to a much smaller cut than, e.g., producing abalanced partition. However, in the case where one has only a small number of hyperedges this effectis much smaller and we will see in the experiments that p = 1 also leads to reasonable solutions. In Section 2.1, we discussed the difference between the hypergraph cut (1) and the graph cut ofthe clique expansion (2) of the hypergraph and gave a simple example in Figure 1 where thesecuts yield quite different results. Clearly, this difference carries over to the famous normalized cutcriterion introduced in [21, 22] for clustering of graphs with applications in image segmentation.For a hypergraph the ratio resp. normalized cut can be formulated as
RCut(
C, C ) = cut H ( C, C ) | C || C | , NCut(
C, C ) = cut H ( C, C )vol( C ) vol( C ) , which incorporate different balancing criteria. Note, that in contrast to the normalized cut for graphsthe normalized hypergraph cut allows no relaxation into a linear eigenproblem (spectral relaxation).Thus, we follow a recent line of research [14, 15, 16, 17] where it has been shown that the standardspectral relaxation of the normalized cut used in spectral clustering [22] is loose and that a tight, infact exact, relaxation can be formulated in terms of a nonlinear eigenproblem. Although nonlineareigenproblems are non-convex, one can compute nonlinear eigenvectors quite efficiently at the priceof loosing global optimality. However, it has been shown that the potentially non-optimal solutionsof the exact relaxation, outperform in practice the globally optimal solution of the loose relaxation,often by large margin. In this section, we extend their approach to hypergraphs and consider generalbalanced hypergraph cuts Bcut(
C, C ) of the form, Bcut(
C, C ) = cut H ( C,C )ˆ S ( C ) , where ˆ S : 2 V → R + is a non-negative, symmetric set function (that is ˆ S ( C ) = ˆ S ( C ) ). For the normalized cut one has ˆ S ( C ) = vol( C ) vol( C ) whereas for the Cheeger cut one has ˆ S ( C ) = min { vol C, vol C } . Otherexamples of balancing functions can be found in [16]. Our following result shows that the balancedhypergraph cut also has an exact relaxation into a continuous nonlinear eigenproblem [14]. Theorem 4.1.
Let H = ( V, E, w ) be a finite, weighted hypergraph and S : R | V | → R be the Lovaszextension of the symmetric, non-negative set function ˆ S : 2 V → R . Then, it holds that min f ∈ R | V | (cid:80) e ∈ E w e (cid:0) max i ∈ e f i − min j ∈ e f j (cid:1) S ( f ) = min C ⊂ V cut H ( C, C )ˆ S ( C ) . Further, let f ∈ R | V | and define C t := { i ∈ V | f i > t } . Then, min t ∈ R cut H ( C t , C t )ˆ S ( C t ) ≤ (cid:80) e ∈ E w e (cid:0) max i ∈ e f i − min j ∈ e f j (cid:1) S ( f ) . Proof.
By Prop. 2.2 the Lovasz extension of cut H ( C, C ) is given by (cid:80) e ∈ E w e (cid:0) max i ∈ e f i − min j ∈ e f j (cid:1) .Noting that both cut H ( C, C ) and ˆ S ( C ) vanish on the full set V , the proof then follows from therecent result [17], which shows in this case the equivalence between the set problem and the contin-uous problem written in terms of the Lovasz extensions.5he last part of the theorem shows that “optimal thresholding” (turning f ∈ R V into a partition)among all level sets of any f ∈ R | V | can only lead to a better or equal balanced hypergraph cut.The question remains how to minimize the ratio Q ( f ) = TV H ( f ) S ( f ) . As discussed in [16], everyLovasz extension S can be written as a difference of convex positively -homogeneous functions S = S − S . Moreover, as shown in Prop. 2.2 the total variation TV H is convex. Thus, we haveto minimize a non-negative ratio of a convex and a difference of convex (d.c.) function. We employthe RatioDCA algorithm [16] shown in Algorithm 1. The main part is the convex inner problem. In Algorithm 1 RatioDCA – Minimization of a non-negative ratio of -homogeneous d.c. functions Objective: Q ( f ) = R ( f ) − R ( f ) S ( f ) − S ( f ) . Initialization: f = random with (cid:13)(cid:13) f (cid:13)(cid:13) = 1 , λ = Q ( f ) repeat s ( f k ) ∈ ∂S ( f k ) , r ( f k ) ∈ ∂R ( f k ) f k +1 = arg min (cid:107) u (cid:107) ≤ (cid:8) R ( u ) − (cid:10) u, r ( f k ) (cid:11) + λ k (cid:0) S ( u ) − (cid:10) u, s ( f k ) (cid:11) (cid:1)(cid:9) λ k +1 = ( R ( f k +1 ) − R ( f k +1 )) / ( S ( f k +1 ) − S ( f k +1 )) until | λ k +1 − λ k | λ k < (cid:15) Output: eigenvalue λ k +1 and eigenvector f k +1 .our case R = T V H , R = 0 , and thus the inner problem reads min (cid:107) u (cid:107) ≤ { TV H ( u ) + λ k (cid:0) S ( u ) − (cid:10) u, s ( f k ) (cid:11) (cid:1) } . (4)For simplicity we restrict ourselves to submodular balancing functions, in which case S is convexand thus S = 0 . For the general case, see [16]. Note that the balancing functions of ratio/normalizedcut and Cheeger cut are submodular. It turns out that the inner problem is very similar to the semi-supervised learning formulation (3). The efficient solution of both problems is discussed next. The problem (3) we want to solve for semi-supervised learning and the inner problem (4) of Ra-tioDCA have a common structure. They are the sum of convex functionals where one of them is thenovel regularizer Ω H,p . We propose to solve these problems using a primal-dual algorithm, denotedPDHG in this paper, which was proposed in [23, 24]. Its main idea is to iteratively solve for eachconvex term in the objective function a so-called proximal problem. Solving the proximal problemw.r.t. a mapping g : R n → R and a vector ˜ x ∈ R n means to compute the proximal map prox g defined by prox g (˜ x ) = arg min x ∈ R n { (cid:107) x − ˜ x (cid:107) + g ( x ) } . The main idea here is that often these proximal problems can be solved efficiently leading to a fastconvergence of the overall algorithm. In order to point out the common structure of PDHG for both(3) and the inner problems of Algorithm 1, we first consider a general optimization problem of theform min f ∈ R n { G ( f ) + F ( Kf ) } , (5)where K ∈ R m,n and G : R n → R , F : R m → R are lower-semicontinuous convex functions.Recall that the conjugate function of G ∗ of G is defined as G ∗ ( x ) = sup f ∈ R n {(cid:104) x, f (cid:105) − G ( f ) } and similarly for F ∗ . In terms of these conjugate functions, we can write the dual problem of (5) as − min α ∈ R m { G ∗ ( − K T α ) + F ∗ ( α ) } . (6)The PDHG algorithm for (5) has the following general form. For convergence proofs we refer to[23, 24]. A function f : R d → R is positively -homogeneous if ∀ α > , f ( αx ) = αf ( x ) . lgorithm 2 PDHG Initialization: f (0) = ¯ f (0) = 0 , θ ∈ [0 , , σ, τ > with στ < / (cid:107) K (cid:107) repeat α ( k ) = prox σF ∗ ( α ( k ) + σK ¯ f ( k ) ) f ( k +1) = prox τG ( f ) ( f ( k ) − τ K T ( α ( k ) )) ¯ f ( k +1) = f ( k +1) + θ ( f ( k +1) − f ( k ) ) until relative duality gap < (cid:15) Output: f ( k +1) .We will now apply this general setting to the convex optimization problems arising in this paper.First, the following Table 1 shows how one can choose G in (5) in order to solve (3) and (4), providesthe solutions of the corresponding proximal problems, and gives the conjugate functions. However,note that smooth convex terms can also be directly exploited [25]. Note that we write the constraintin the inner problem of RatioDCA via the indicator function ι (cid:107)·(cid:107) ≤ defined by ι (cid:107)·(cid:107) ≤ ( x ) = 0 , if (cid:107) x (cid:107) ≤ and + ∞ otherwise. Clearly, both proximal problems have an explicit solution. G ( f ) = (cid:107) f − Y (cid:107) G ( f ) = −(cid:104) s ( f k ) , f (cid:105) + ι (cid:107)·(cid:107) ≤ ( f )prox τG ( f ) (˜ x ) = τ (˜ x + τ Y ) prox τG ( f ) (˜ x ) = ˜ x + τs ( f k )max { , (cid:107) ˜ x + τs ( f k ) (cid:107) } G ∗ ( x ) = (cid:107) x + Y (cid:107) − (cid:107) Y (cid:107) G ∗ ( x ) = (cid:107) x + s ( f k ) (cid:107) Table 1: Data terms of the SSL functional (3) (left) and the inner problem of RatioDCA (4) (right)with respective proximal map and conjugate.Second, we discuss the choice of F and K to incorporate Ω H,p . PDHG algorithm for Ω H, . Let m e denote the number of vertices in hyperedge e ∈ E . The mainidea is to write λ Ω H, ( f ) = F ( Kf ) := (cid:88) e ∈ E ( F ( e, ( K e f ) + F ( e, ( K e f )) , (7)where the rows of the matrices K e ∈ R m e ,n are the i -th standard unit vectors for i ∈ e and thefunctionals F ( e,j ) : R m e → R are defined as F ( e, ( α ( e, ) = λw e max( α ( e, ) , F ( e, ( α ( e, ) = − λw e min( α ( e, ) . The primal problem has thus the form min f ∈ R n { G ( f ) + (cid:88) e ∈ E ( F ( e, ( K e f ) + F ( e, ( K e f )) } . In contrast to the function G , we need in the PDHG algorithm the proximal maps for the conjugatefunctions of F ( e,j ) . They are given by F ∗ ( e, = ι S λwe , F ∗ ( e, = ι − S λwe , where S λw e = { x ∈ R m e : (cid:80) m e i =1 x i = λw e , x i ≥ } is the scaled simplex in R m e . By (6) thedual problem has the form − min α ( e, ,α ( e, { G ∗ ( − (cid:88) e ∈ E K T e ( α ( e, + α ( e, )) + (cid:88) e ∈ E ( ι S eλwe ( α ( e, ) + ι − S eλwe ( α ( e, )) } , where G ∗ is given as in Table 1. The solutions of the proximal problems for F ∗ ( e, and F ∗ ( e, are theorthogonal projections onto these simplexes written here as P S eλwe and P − S eλwe , respectively. Theseprojections can be performed in linear time, cf., [26].Using the proximal mappings we have presented so far, we obtain Algorithm 3. In line 1, c i = (cid:80) e ∈ E H i,e is the number of hyperedges the vertex i lies in. The bound on the product of the stepsizes can be derived as follows (cid:107) K (cid:107) = (cid:107) K T K (cid:107) = 2 (cid:107) (cid:88) e ∈ E K T e K e (cid:107) = 2 max i =1 ,...,n { c i } . lgorithm 3 PDHG for Ω H, Initialization: f (0) = ¯ f (0) = 0 , θ ∈ [0 , , σ, τ > with στ < / (2 max i =1 ,...,n { c i } ) repeat α ( e, k +1) = P S eλwe ( α ( e, k ) + σK e ¯ f ( k ) ) , e ∈ E α ( e, k +1) = P − S eλwe ( α ( e, k ) + σK e ¯ f ( k ) ) , e ∈ E f ( k +1) = prox τG ( f ( k ) − τ (cid:80) e ∈ E K T e ( α ( e, k +1) + α ( e, k +1) )) ¯ f ( k +1) = f ( k +1) + θ ( f ( k +1) − f ( k ) ) until relative duality gap < (cid:15) Output: f ( k +1) .It is important to point out here that the algorithm decouples the problem in the sense that in everyiteration we solve subproblems which treat the functionals G, F ( e, , F ( e, separately and thus canbe solved in an efficient way. PDHG algorithm for Ω H, . We define G and K e as above. Moreover, we set F e ( α e ) = λw e (max( α e ) − min( α e )) (cid:124) (cid:123)(cid:122) (cid:125) =: h e ( α e ) . (8)Hence, the primal problem can be written as min f ∈ R n { G ( f ) + (cid:88) e ∈ E F e ( K e f ) } . In order to formulate the dual problem, we need the conjugate of F e . To this end, we first derive theconjugate function of h e defined in (8), i.e., h ∗ e ( α e ) = sup φ ∈ R me {(cid:104) α e , φ (cid:105) − (max( φ ) − min( φ )) } . Lemma 5.1.
Let α e ∈ R m e and t + = (cid:80) i : α ei > α ei and t − = (cid:80) i : α ei < α ei . It holds that h ∗ e ( α e ) = (cid:26) t if (cid:104) α e , (cid:105) = 0 , + ∞ otherwise.Proof. Using the decomposition, φ = ψ + γ , where (cid:104) ψ, (cid:105) = 0 and γ ∈ R , we can write (cid:104) α e , φ (cid:105) − (max( φ ) − min( φ )) = γ (cid:104) α e , (cid:105) + (cid:104) α e , ψ (cid:105) − (max( ψ ) − min( ψ )) . Thus for (cid:104) α e , (cid:105) (cid:54) = 0 , we have h ∗ e ( α e ) = ∞ . Now we consider the case where (cid:104) α e , (cid:105) = 0 . Wewrite I − = { i : α ei < } and I + = { i : α ei > } and define t + = (cid:80) i ∈ I + α ei and t − = (cid:80) i ∈ I − α ei .Note that (cid:104) α e , (cid:105) = 0 implies t + = − t − . Let us assume a = max( φ ) and b = min( φ ) are fixed. Tomaximize (cid:104) α e , φ (cid:105) − (max( φ ) − min( φ )) it is clearly best to choose φ i = a for i ∈ I − and φ i = b for i ∈ I + . Consequently, (cid:104) α e , φ (cid:105) − (max( φ ) − min( φ )) = t + ( b − a ) − ( b − a ) . (9)We maximize the gap ∆ = b − a for the objective m (∆) = t + ∆ − ∆ and obtain the maximizer as ∆ = t + . Thus we have h ∗ e ( α e ) = t if (cid:104) α e , (cid:105) (cid:54) = 0 .With t + = (cid:80) i : α ei > α ei and t − = (cid:80) i : α ei < α ei we thus get F ∗ e ( α e ) = λ w e h ∗ (cid:18) α e λw e (cid:19) = (cid:26) λw e t if t + = − t − , + ∞ otherwise. (10)So, we obtain the dual problem − min α e { G ∗ ( − (cid:88) e ∈ E K T e α e ) + (cid:88) e ∈ E λw e ( t e + ) + (cid:88) e ∈ E ι { } ( t e + + t e − ) } , t e + = (cid:80) i : α ei > α ei and t e − = (cid:80) i : α ei < α ei .As we have seen in (10), the conjugate functions F ∗ e are not indicator functions and we thus solvethe corresponding proximal problems via proximal problems for F e . More specifically, we exploitthe fact that prox σF ∗ e (˜ α e ) = ˜ α e − prox σ F e (˜ α e ) , (11)see [27, Lemma 2.10], and use the following novel result concerning the proximal problem on theright-hand side of (11). Proposition 5.1.
For any σ > and any ˜ α e ∈ R m e the proximal map prox σ F e (˜ α e ) = arg min α e ∈ R me { (cid:107) α e − ˜ α e (cid:107) + 1 σ λw e (max( α e ) − min( α e )) } can be computed with O ( m e log m e ) arithmetic operations. We will now derive such an algorithm. To simplify the notation, we consider instead of σ F e thefunction h : R m → R defined by h ( α ) = (max( α ) − min( α )) and show that prox µh ( α ) , µ > , can be computed with O ( m log m ) arithmetic operations.Let us fix α ∈ R m . For every pair r, s ∈ [min( α ) , max( α )] with r ≥ s , we define α ( r,s ) by α ( r,s ) i = (cid:40) r if α i ≥ rα i if α i ∈ ( r, s ) s if α i ≤ s (12)Clearly, if r = max(prox µh ( α )) and s = min(prox µh ( α )) then α ( r,s ) = prox µh ( α ) . Hence, theabove definition allows us to write the proximal problem in terms of the variables r, s since for ( r, s ) = arg min ˜ r, ˜ s { (cid:107) α (˜ r, ˜ s ) − α (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) =: E (˜ r, ˜ s ) + µ (˜ r − ˜ s ) (cid:124) (cid:123)(cid:122) (cid:125) =: E (˜ r, ˜ s ) } (13)we have prox µh ( α ) = α ( r,s ) . Our goal is now to find a minimizer of (13). To this end, we first order α in an increasing order whichcan be done in O ( m log m ) arithmetic operations. W.l.o.g. we assume here that the components of α are pairwise different. Moreover, we introduce the following notation. For r, s ∈ [ α , α m ] there existunique p, q ∈ { , . . . , m } characterized by α m − p +1 = min { α i | α i ≥ r } and α q = max { α i | α i ≤ s } . Thus, the directional partial derivatives w.r.t. r and s are given by ∂E ∂r − ( r, s ) = m (cid:88) i = m − p +1 ( α i − r ) , ∂E ∂s + ( r, s ) = q (cid:88) i =1 ( s − α i ) . (14)They tell us how much we increase E by decreasing r and increasing s , respectively. On the otherhand both of these changes lead to a decrease in the energy E . More precisely, it holds that ∂E ∂r − ( r, s ) = ∂E ∂s + ( r, s ) = 2 µ ( s − r ) . (15)Thus, the main ideas behind our algorithm are as follows. Starting with r = max( α ) and s =min( α ) , we decrease r and increase s keeping the two partial derivatives of (14) equal. We stopwhen the sum of the partial derivatives vanishes. So, the optimal r, s are characterized by the system m (cid:88) i = m − p +1 ( α i − r ) = q (cid:88) i =1 ( s − α i ) , (16) m (cid:88) i = m − p +1 ( α i − r ) + 2 µ ( s − r ) = 0 . (17)We will now generate a sequence of pairs r ( k ) , s ( k ) satisfying r ( k ) ≥ s ( k ) and (16) for each k . Thecorresponding indices needed to calculate the partial derivatives will be denoted by p ( k ) , q ( k ) . Themain procedure is described in the next lemma. 9 emma 5.2. Assume r ( k ) ∈ ( α m − p ( k ) , α m − p ( k ) +1 ] and s ( k ) ∈ [ α q ( k ) , α q ( k ) +1 ) and property (16) holds for ( r ( k ) , s ( k ) ) . Then, we can either choose r ( k +1) = r ( k ) − q ( k ) p ( k ) ( s ( k +1) − s ( k ) ) and s ( k +1) = α q ( k ) +1 (18) or r ( k +1) = α m − p ( k ) and s ( k +1) = s ( k ) + p ( k ) q ( k ) ( r ( k ) − r ( k +1) ) (19) such that r ( k +1) ∈ [ α m − p ( k ) , α m − p ( k ) +1 ) , s ( k +1) ∈ ( α q ( k ) , α q ( k ) +1 ] and (16) holds true for ( r ( k +1) , s ( k +1) ) .Proof. Property (16) for ( r ( k +1) , s ( k +1) ) means that m (cid:88) i = m − p ( k ) +1 ( α i − r ( k +1) ) = q ( k ) (cid:88) i =1 ( s ( k +1) − α i ) . (20)Since by assumption (16) holds for ( r ( k ) , s ( k ) ) , equation (20) is equivalent to p ( k ) ( r ( k +1) − r ( k ) ) = q ( k ) ( s ( k ) − s ( k +1) ) . If we set ( r ( k +1) , s ( k +1) ) according to (18) but r ( k +1) < α m − p ( k ) . Then we get r ( k ) − q ( k ) p ( k ) ( α q ( k ) +1 − s ( k ) ) < α m − p ( k ) ⇒ s ( k ) + p ( k ) q ( k ) ( r ( k ) − α m − p ( k ) ) < α q ( k ) +1 , i.e., we can choose r ( k +1) , s ( k +1) according to (19) and vice versa.After each computation of a new pair ( r ( k +1) , s ( k +1) ) we check if the left-hand side of (17) issmaller than zero (note that initially the left-hand side of (17) is negative and it is increasing forevery iteration). If this is not the case, we found the intervals where the optimal values r and s liein. Restricted to this domain the functional E + E is a differentiable. Hence, we can compute r, s as follows. Lemma 5.3.
Assume that the optimal r, s of (13) fulfill r ∈ [ α m − p , α m − p +1 ] and s ∈ [ α q , α q +1 ] .Then, it holds that s = (cid:0) q + 2 µ − (2 µ ) (cid:80) mi = m − p +1 α i + 2 µ (cid:1) − (cid:0) µp + 2 µ m (cid:88) i = m − p +1 α i + q (cid:88) i =1 α i (cid:1) r = 12 µ (cid:0) ( q + 2 µ ) s − q (cid:88) i =1 α i (cid:1) . Proof.
When restricted to [ α i , α i +1 ] × [ α j , α j +1 ] , the function ( r, s ) (cid:55)→ E ( r, s ) + E ( r, s ) is aquadratic function in ( r, s ) . We can thus simply set the gradient to zero and solve the correspondingsystem of linear equations which yields the above result.In conclusion, we obtain the following algorithm. Note that after the sorting, the algorithm takes inthe order of m steps to compute the proximal map which proves Proposition 5.1.Hence, the corresponding PDHG algorithm can be formulated as follows.We solve the subproblems in line 3 via Algorithm 4. Note that the bound on the step sizes is nowdoubled, i.e., less restrictive since we have defined for each hyperedge one functional F e and nottwo as for p = 1 , i.e., (cid:107) K (cid:107) = (cid:107) K T K (cid:107) = (cid:107) (cid:88) e ∈ E K T e K e (cid:107) = max i =1 ,...,n { c i } . lgorithm 4 – Solution of the proximal problem prox µh ( α ) Sort α ∈ R m in increasing order. Initialization: r (0) = max( α ) , s (0) = min( α ) while ∂E ∂r − ( r ( k ) , s ( k ) ) < µ ( r ( k ) − s ( k ) ) and q ( k ) + 1 ≤ m − p ( k ) do Find ( r ( k +1) , s ( k +1) ) according to Lemma 5.2. end while Compute r, s as described in Lemma 5.3. Output:
After restoring the original order, set (prox µh ( α )) i = (cid:40) r if α i ≥ r,α i if α i ∈ ( r, s ) ,s if α i ≤ s, for i = 1 , . . . , m. Algorithm 5 PDHG for Ω H, Initialization: f (0) = ¯ f (0) = 0 , θ ∈ [0 , , σ, τ > with στ < / max i =1 ,...,n { c i } repeat α e ( k +1) = α e ( k ) + σK e ¯ f ( k ) − prox σ F e ( α e ( k ) + σK e ¯ f ( k ) ) , e ∈ E f ( k +1) = prox τG ( f ( k ) − τ (cid:80) e ∈ E K T e ( α e ( k +1) )) ¯ f ( k +1) = f ( k +1) + θ ( f ( k +1) − f ( k ) ) until relative duality gap < (cid:15) Output: f ( k +1) . The method of Zhou et al [11] seems to be the standard algorithm for clustering and SSL on hyper-graphs. We compare to them on a selection of UCI datasets summarized in Table 2. Zoo, Mushroomsand 20Newsgroups have been used also in [11] and contain only categorical features. As in [11],a hyperedge of weight one is created by all data points which have the same value of a categoricalfeature. For covertype we quantize the numerical features into 10 bins of equal size. Two datasetsare created each with two classes (4,5 and 6,7) of the original dataset. Semi-supervised Learning (SSL).
In [11], they suggest using a regularizer induced by the nor-malized Laplacian L CE arising from the clique expansion L CE = I − D − CE HW (cid:48) H T D − CE , where D CE is a diagonal matrix with entries d EC ( i ) = (cid:80) e ∈ E H i,e w e | e | and W (cid:48) ∈ R | E |×| E | is adiagonal matrix with entries w (cid:48) ( e ) = w e / | e | . The SSL problem can then be formulated as λ > , arg min f ∈ R | V | {(cid:107) f − Y (cid:107) + λ (cid:104) f, L CE f (cid:105)} . This is a modified version by Sam Roweis of the original 20 newsgroups dataset available at .Prop. \ Dataset Zoo Mushrooms Covertype (4,5) Covertype (6,7) 20NewsgroupsNumber of classes 7 2 2 2 4 | V |
101 8124 12240 37877 16242 | E |
42 112 104 123 100 (cid:80) e ∈ E | e | | E | of Clique Exp. 10201 65999376 143008092 1348219153 53284642 Table 2: Datasets used for SSL and clustering. Note that the clique expansion leads for all datasetsto a graph which is close to being fully connected as all datasets contain large hyperedges. Forcovertype (6,7) the weight matrix needs over 10GB of memory, the original hypergraph only 4MB.11he advantage of this formulation is that the solution can be found via a linear system. However, asTable 2 indicates the obvious downside is that L CE is a potentially very dense matrix and thus oneneeds in the worst case | V | memory and O ( | V | ) computations. This is in contrast to our methodwhich needs (cid:80) e ∈ E | e | + | V | memory. For the largest example (covertype 6,7), where the cliqueexpansion fails due to memory problems, our method takes 30-100s (depending on λ ). We stop ourmethod for all experiments when we achieve a relative duality gap of − . In the experiments wedo 10 trials for different numbers of labeled points. The reg. parameter λ is chosen for both methodsfrom the set − k , where k = { , , , , , , } via 5-fold cross validation. The resulting errorsand standard deviations can be found in the following table(first row lists the no. of labeled points).Our SSL methods based on Ω H,p , p = 1 , outperform consistently the clique expansion techniqueof Zhou et al [11] on all datasets except 20newsgroups . However, 20newsgroups is a very difficultdataset as only 10,267 out of the 16,242 data points are different which leads to a minimum possibleerror of . . A method based on pairwise interaction such as the clique expansion can better dealwith such label noise as the large hyperedges for this dataset accumulate the label noise. On allother datasets we observe that incorporating hypergraph structure leads to much better results. Asexpected our squared TV functional ( p = 2 ) outperforms slightly the total variation ( p = 1 ) eventhough the difference is small. Thus, as Ω H, reduces to the standard regularization based on thegraph Laplacian, which is known to work well, we recommend Ω H, for SSL on hypergraphs. Zoo 20 25 30 35 40 45 50Zhou et al. . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . Mushr. 20 40 60 80 100 120 160 200Zhou et al. . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . covert45 20 40 60 80 100 120 160 200Zhou et al. . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . covert67 20 40 60 80 100 120 160 200 Ω H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Ω H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . H, . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Test error and standard deviation of the SSL methods over 10 runs for varying number of labeledpoints.
Clustering.
We use the normalized hypergraph cut as clustering objective. For more than twoclusters we recursively partition the hypergraph until the desired number of clusters is reached.For comparison we use the normalized spectral clustering approach based on the Laplacian L CE [11](clique expansion). The first part (first 6 columns) of the following table shows the clusteringerrors (majority vote on each cluster) of both methods as well as the normalized cuts achieved bythese methods on the hypergraph and on the graph resulting from the clique expansion. Moreover,we show results (last 4 columns) which are obtained based on a k NN graph (unit weights) whichis built based on the Hamming distance (note that we have categorical features) in order to check ifthe hypergraph modeling of the problem is actually useful compared to a standard similarity basedgraph construction. The number k is chosen as the smallest number for which the graph becomesconnected and we compare results of normalized -spectral clustering [14] and the standard spectralclustering [22]. Note that the employed hypergraph construction has no free parameter. Clustering Error % Hypergraph Ncut Graph(CE) Ncut Clustering Error % k NN-Graph NcutDataset Ours [11] Ours [11] Ours [11] [14] [22] [14] [22]Mushrooms 10.98 32.25 0.0011 0.0013 0.6991 0.7053 48.2 48.2 1e-4 1e-4Zoo 16.83 15.84 0.6739 0.6784 5.1315 5.1703 5.94 5.94 1.636 1.63620-newsgroup 47.77 33.20 0.0176 0.0303 2.3846 1.8492 66.38 66.38 0.1031 0.1034covertype (4,5) 22.44 22.44 0.0018 0.0022 0.7400 0.6691 22.44 22.44 0.0152 0.02182covertype (6,7) 8.16 - 8.18e-4 - 0.6882 - 45.85 45.85 0.0041 0.0041 Communications with the authors of [11] could not clarify the difference to their results on 20newsgroups
Acknowledgments
M.H. would like to acknowledge support by the ERC Starting Grant NOLEPRO and L.J. acknowl-edges support by the DFG SPP-1324.
References [1] Y. Huang, Q. Liu, and D. Metaxas. Video object segmentation by hypergraph cut. In
CVPR , pages 1738– 1745, 2009.[2] P. Ochs and T. Brox. Higher order motion models and spectral clustering. In
CVPR , pages 614–621,2012.[3] S. Klamt, U.-U. Haus, and F. Theis. Hypergraphs and cellular networks.
PLoS Computational Biology ,5:e1000385, 2009.[4] Z. Tian, T. Hwang, and R. Kuang. A hypergraph-based learning algorithm for classifying gene expressionand arraycgh data with prior knowledge.
Bioinformatics , 25:2831–2838, 2009.[5] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: an approach based on dynamicalsystems.
VLDB Journal , 8:222–236, 2000.[6] J. Bu, S. Tan, C. Chen, C. Wang, H. Wu, L. Zhang, and X. He. Music recommendation by unified hyper-graph: Combining social media information and music content. In
Proc. of the Int. Conf. on Multimedia(MM) , pages 391–400, 2010.[7] A. Shashua, R. Zass, and T. Hazan. Multi-way clustering using super-symmetric non-negative tensorfactorization. In
ECCV , pages 595–608, 2006.[8] S. Rota Bulo and M. Pellilo. A game-theoretic approach to hypergraph clustering. In
NIPS , pages 1571–1579, 2009.[9] M. Leordeanu and C. Sminchisescu. Efficient hypergraph clustering. In
AISTATS , pages 676–684, 2012.[10] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Petrona, D. J. Kriegman, and S. Belongie. Beyond pairwiseclustering. In
CVPR , pages 838–845, 2005.[11] D. Zhou, J. Huang, and B. Sch¨olkopf. Learning with hypergraphs: Clustering, classification, and embed-ding. In
NIPS , pages 1601–1608, 2006.[12] S. Agarwal, K. Branson, and S. Belongie. Higher order learning with graphs. In
ICML , pages 17–24,2006.[13] E. Ihler, D. Wagner, and F. Wagner. Modeling hypergraphs by graphs with the same mincut properties.
Information Processing Letters , 45:171–175, 1993.[14] M. Hein and T. B¨uhler. An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In
NIPS , pages 847–855, 2010.[15] A. Szlam and X. Bresson. Total variation and Cheeger cuts. In
ICML , pages 1039–1046, 2010.[16] M. Hein and S. Setzer. Beyond spectral clustering - tight relaxations of balanced graph cuts. In
NIPS ,pages 2366–2374, 2011.[17] T. B¨uhler, S. Rangapuram, S. Setzer, and M. Hein. Constrained fractional set programs and their applica-tion in local clustering and community detection. In
ICML , pages 624–632, 2013.[18] F. Bach. Learning with submodular functions: A convex optimization perspective.
CoRR , abs/1111.6453,2011.[19] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds.
Machine Learning , 56:209–239, 2004.
20] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf. Learning with local and global consistency.In
NIPS , volume 16, pages 321–328, 2004.[21] J. Shi and J. Malik. Normalized cuts and image segmentation.
IEEE Trans. Patt. Anal. Mach. Intell. ,22(8):888–905, 2000.[22] U. von Luxburg. A tutorial on spectral clustering.
Statistics and Computing , 17:395–416, 2007.[23] E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of first order primal-dual algorithmsfor convex optimization in imaging science.
SIAM Journal on Imaging Sciences , 3(4):1015–1046, 2010.[24] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications toimaging.
J. of Math. Imaging and Vision , 40:120–145, 2011.[25] L. Condat. A primaldual splitting method for convex optimization involving lipschitzian, proximable andlinear composite terms.
J. Optimization Theory and Applications , 158(2):460–479, 2013.[26] K. Kiwiel. On Linear-Time algorithms for the continuous quadratic knapsack problem.
J. Opt. TheoryAppl. , 134(3):549–554, 2007.[27] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting.
MultiscaleModeling and Simulation , 4(4):1168–1200, 2005., 4(4):1168–1200, 2005.