[PDF] Hierarchical Clusterings of Unweighted Graphs

Abstract

We study the complexity of finding an optimal hierarchical clustering of an unweighted similarity graph under the recently introduced Dasgupta objective function. We introduce a proof technique, called the normalization procedure, that takes any such clustering of a graph G and iteratively improves it until a desired target clustering of G is reached. We use this technique to show both a negative and a positive complexity result. Firstly, we show that in general the problem is NP-complete. Secondly, we consider min-well-behaved graphs, which are graphs H having the property that for any k the graph H(k) being the join of k copies of H has an optimal hierarchical clustering that splits each copy of H in the same optimal way. To optimally cluster such a graph H(k) we thus only need to optimally cluster the smaller graph H . Co-bipartite graphs are min-well-behaved, but otherwise they seem to be scarce. We use the normalization procedure to show that also the cycle on 6 vertices is min-well-behaved.

Full PDF

HHierarchical Clusterings of Unweighted Graphs

Svein Høgemo, Cristophe Paul and Jan Arne TelleAugust 10, 2020

Abstract

We study the complexity of ﬁnding an optimal hierarchical clusteringof an unweighted similarity graph under the recently introduced Dasguptaobjective function. We introduce a proof technique, called the normaliza-tion procedure, that takes any such clustering of a graph G and itera-tively improves it until a desired target clustering of G is reached. We usethis technique to show both a negative and a positive complexity result.Firstly, we show that in general the problem is NP-complete. Secondly, weconsider min-well-behaved graphs, which are graphs H having the prop-erty that for any k the graph H ( k ) being the join of k copies of H hasan optimal hierarchical clustering that splits each copy of H in the sameoptimal way. To optimally cluster such a graph H ( k ) we thus only need tooptimally cluster the smaller graph H . Co-bipartite graphs are min-well-behaved, but otherwise they seem to be scarce. We use the normalizationprocedure to show that also the cycle on 6 vertices is min-well-behaved. Clustering is an unsupervised machine learning technique and one of the mostimportant problems in data-mining [3, 9–11]. Given a data set and a pairwisesimilarity measure, the task is to partition the data set into clusters so thatsimilar data points belong to the same cluster. In a hierarchical clustering thedata set is recursively partitioned into smaller clusters, by means of a rootedbinary tree whose leaves are in one-to-one correspondence with the data points.Hierarchical clustering emerged as a central task in the study of phylogenetictrees [2, 12]. Such a clustering is very general, capturing clustering structure atall levels of granularity, with a clustering into two parts given by the root ofthe tree, and ﬁner clusterings given by lower levels of the tree. Algorithms forhierarchical clustering have been widely used for many years, but it was onlyrecently that an objective function to measure their quality was formalized. In aSTOC 2017 paper [7] Dasgupta introduced a natural objective function measur-ing the global cost of a hierarchical clustering. From now on, this function willbe called the Dasgupta Clustering function - DC function. Several follow-ups toDasgupta’s work have appeared, we mention only a couple: in [4], the authorsimprove the ratio of the approximation algorithm proposed by Dasgutpa; in [5],1 a r X i v : . [ c s . CC ] A ug he authors revisit the DC function and propose some axioms that a ”good”cost function should satisfy.In this paper we investigate the complexity of ﬁnding the DC-optimal hier-archical clustering for unweighted similarity graphs. Thus, we assume that anypair of data points has been marked as either ’similar’ or ’non-similar’ and rep-resent this information as an undirected, unweighted graph G whose vertex set V ( G ) is the set of data points and adjacencies represent similarity. We ask foran HC-tree (a Hierarchical Clustering tree), a rooted binary tree T with leavesin one-to-one correspondence with V ( G ), such that the DC-cost of T - i.e. thesum over all edges uv of G , of the number of leaves of the subtree rooted atthe least common ancestor of u and v - is minimized. Dasgupta [7] showed thatthe edge-weighted version of this problem, with weights representing degree ofsimilarity, is NP-complete. In this paper we focus on unweighted graphs, thehardness of which was left open by Dasgupta [6]. Unweighted graphs naturallyappear in this context, for example in the correlation clustering problem [1].It is also a common approach to transform a similarity matrix into a similar-ity graph by ﬁxing a threshold value that determines whether two objects aresimilar or not (see [9] for example). We focus on dense similarity graphs. Suchgraphs typically appear when there is a ﬁxed threshold for similarity that isset to be very low, for example the existence of email correspondence withina single (small) organization, or existence of non-zero trade relations betweencountries. We show that the problem remains NP-complete, already for densegraphs. More precisely, by a reduction building on the one used in [7], we es-tablish the NP-hardness for unweighted n -vertex graphs where every vertex hasat least n − G is the complement of a bipartite graph on color classes A, B then any HC-tree T that splits A and B at the root is optimal, which follows easily fromobservations in [7] since G [ A ] and G [ B ] are complete graphs. Dasgupta showedthat minimizing the DC-cost of G is equivalent to maximizing the DC-cost ofthe complement of G . Thus the previous result can be restated to say that fora bipartite graph any HC-tree splitting the two color classes at the root willhave max DC-cost, rendering the result trivial as all edges are now split at theroot. In the current paper we will usually take this viewpoint, thus considering unweighted sparse graphs and looking for an HC-tree maximizing the DC-cost,typically splitting pairs of adjacent vertices, now denoting non-similarity, athigher levels of the tree.As noted, bipartite graphs are then trivial, but what other graphs can behandled eﬃciently? What about G being a collection of disjoint copies of thesame bipartite graph? Maximizing DC-cost is still trivial, in fact G is againbipartite, so at the root we can simply split each copy in the same optimal way.Let us deﬁne a more complex property generalizing this behavior. Consider agraph H of max DC-cost W achievable by some HC-tree T and let the graph2 , c c s c c s c c s Ts s s c c T (cid:48) s s s s (cid:48) s (cid:48) s (cid:48) c (cid:48) c c (cid:48) c Figure 1: The complete split graph Q , is not max-well-behaved. We have DC - cost ( Q , , T ) = 6 × × T (cid:48) of Q ( k )2 , with k = 2 (vertices s , c , ... in one copy and s (cid:48) , c (cid:48) , ... in the othercopy) satisﬁes DC - cost ( Q ( k )2 , , T (cid:48) ) = 130 which is larger than DC - cost ( Q , , T ) × k = 128, i.e. the DC-cost of the factorized HC-tree clustering both copiesaccording to T simultaneously. H ( k ) consist of k disjoint copies of H . If we use T to simultaneously clustereach of the k copies of H then each leaf of T will contain k copies of the samevertex. These vertices induce a stable set so we can further cluster them in anarbitrary way to get an HC-tree T ( k ) . Note that this tree will have DC-cost k W since each edge of H has k copies in H ( k ) , and the subtree of T ( k ) thatsplits an edge contains a multiplicative factor k more vertices than the similarsubtree of T . We call such H max-well-behaved if for any k the max DC-costof H ( k ) is no higher than k t , and the complement of H min-well-behaved.We have argued that any bipartite graph is max-well-behaved, but this isnot the case for all H . For a simple example, in Figure 1 we see that completesplit graphs are not max-well-behaved. In this paper, as a spin-oﬀ of our NP-completeness proof, we initiate the study of well-behaved graphs. We introducea normalization procedure that makes incremental changes to a given HC-treeof some H ( k ) , while observing monotonicity in the DC-cost, to arrive at a newHC-tree showing that H is well-behaved. We employ this to show that theprism graph (the complement of a 6-cycle) is max-well-behaved, and thus C min-well-behaved, establishing the aforementioned NP-completeness along theway. We use standard graph-theoretic notation [8]. A hierarchical clustering of asimilarity graph G = ( V, E ) is a full rooted binary tree T , together with abijection δ from V to L ( T ), the set of leaves of T . We call such a pair ( T, δ ) anHC-tree of G . For a node t of T we denote by T [ t ] the subtree of T rooted at t .3he Dasgupta cost function [7] is this (lca means least common ancestor): DC - cost ( G, ( T, δ )) = (cid:88) uv ∈ E w ( uv ) · | L ( T [ x ]) | : x is the lca of δ ( u ) and δ ( v )and an HC-tree of minimum DC-cost (under Dasgupta’s objective function) isthus an HC-tree ( T ∗ , δ ∗ ) that minimizes DC-cost.Dasgupta shows that any HC-tree with minimum weight for graph G is alsoan HC-tree with maximum weight for its complement G . We consider onlyunweighted graphs, equivalently w ( uv ) = 1 for all uv ∈ E and 0 otherwise. Forany node t ∈ T , we deﬁne G ( T,δ ) [ t ] as the subgraph of G induced by δ − ( L ( T [ t ])),the vertices of G mapped to leaves in T [ t ]. Similarly, for any two nodes t , t ∈ T with L [ t ] ∩ L [ t ] = ∅ , we deﬁne G ( T,δ ) [ t , t ] as the bipartite subgraph of G consisting of all edges with one endpoint in δ − ( L ( T [ t ])) and the otherendpoint in δ − ( L ( T [ t ])). If ( T, δ ) is inferred from context, we further shortenthese to G [ t ] and G [ t , t ]. We can now simplify the Dasgupta cost function onunweighted graphs as follows: DC - cost ( G, ( T, δ )) = (cid:88) t ∈ V ( T ) \ L ( T ) | V ( G [ t ]) | · | E ( G [ c l , c r ]) | : c l , c r children of t We start with a simple but useful fact.

Property 1.

Let

G, G (cid:48) be two edge-disjoint graphs over the same vertex set V ( G ) , and ( T, δ ) an HC-tree of V . The DC-cost of the decomposition on theirunion G U = ( V ( G ) , E ( G ) ∪ E ( G (cid:48) )) is the sum of the costs on each graph: DC - cost ( G U , ( T, δ )) = DC - cost ( G, ( T, δ )) + DC - cost ( G (cid:48) , ( T, δ )) Proof.

The cost of (

T, δ ) on G U is simply the sum, over every edge e ∈ E ( G U ),of the size (i.e. number of vertices) of the subgraph in which e is cut. This isthe same as adding together the sums over every edge in G and every edge in G (cid:48) . Corollary 1 ( [7], Section 4.1 ) . An HC-tree of G with minimum DC-cost isalso an HC-tree of G with maximum DC-cost.Proof. G is by deﬁnition edge-disjoint from G , therefore DC - cost ( G U , ( T, δ )) = DC - cost ( G, ( T, δ )) + DC - cost ( G, ( T, δ )) by Property 1. But the union of G and G is isomorphic to K n where n = | V ( G ) | , and we know that every HC-tree of K n has the same cost, namely ( n − n ) ( [7], Theorem 3). Therefore, for anyHC-tree ( T, δ ), DC - cost ( G, ( T, δ )) = ( n − n ) − DC - cost ( G, ( T, δ )). We concludethat a HC-tree of G with minimum cost is a HC-tree of G with maximum cost,and vice versa. Minimizing DC-cost of a graph is accomplished by the exact same HC-treesthat maximize DC-cost for the complement graph. However, for speciﬁc graph4lasses, like bipartite graphs, it can be easy to ﬁnd an HC-tree maximizing theDC-cost but hard to minimize the DC-cost, or vice-versa. Let us consider a verysimple operation to construct sparse graphs. Take G ( k ) , consisting of k disjointcopies of some graph G . If we are given an HC-tree T for G of minimum DC-costthen any HC-tree for G ( k ) hierarchically clustering each copy of G as done in T will have minimum DC-cost. However, maximizing the DC-cost for G ( k ) seemsharder. Given an HC-tree T of maximum DC-cost for G we call any HC-treefor G ( k ) that hierarchically clusters each copy of G as in T a factorized HC-tree.Let us deﬁne this formally: Deﬁnition 1 (Factorized HC-tree) . Let G be a graph, ( T, δ ) an HC-tree of G ofmaximum DC-cost W , and k a natural number. A factorized HC-tree ( T, δ ) ( k ) of the graph G ( k ) is made as follows: Make a copy of ( T, δ ) and for every node t , make G ( k )( T,δ ) ( k ) [ t ] = k (cid:91) i =1 G ( T,δ ) [ t ]This is not a complete HC-tree, since for t ∈ L ( T ), G ( k ) [ t ] is not a single vertex,but k vertices. But these k vertices are all disjoint, therefore any extensionof this partial HC-tree will have the same DC-cost k W and be regarded as afactorized HC-tree.As previously mentioned, if G is bipartite then for any k the factorized HC-tree for G ( k ) will have max DC-cost. We give this property a name. Deﬁnition 2 (Well-behaved graph) . Let G be an unweighted graph, and W the maximum DC-cost over HC-trees of G . We call G max-well-behaved , or just well-behaved if, for any natural number k , the maximum Dasgupta cost overHC-trees of the graph G ( k ) is equal to k W . The complementary graph G iscalled min-well-behaved .So any bipartite graph G is well-behaved and thus computing the max DC-cost of any G ( k ) can be reduced to computing the max DC-cost of G , or equiv-alently, computing the min DC-cost of G ( k ) (the join of k copies of G ) reducesto computing the min DC-cost of G . We may naturally ask: Is every graphwell-behaved? On the contrary, counterexamples abound, even for very smallgraphs, see Figure 1 for an example.How to show that some interesting non-bipartite graph G is well-behaved?We need to show that for any value of k no HC-tree of G ( k ) has higher DC-costthan the factorized HC-tree. We will show this by what we call a normalizationprocedure on HC-trees: starting with an arbitrary HC-tree we incrementally,step by step, modify it into the factorized HC-tree and show that at no stepdoes the cost decrease. We formalize this notion: Deﬁnition 3 (Safe operation) . An operation that takes an HC-tree of a graph G as input and outputs another HC-tree of the same graph is called safe (formaximization) if the DC-cost of the input is no larger than the DC-cost of theoutput. 5 roperty 2. [Normalization Procedure] Let G have max HC-tree ( T, δ ) . Ifthere is a procedure that for any k takes as input any HC-tree of G ( k ) , iterativelyapplies safe operations, and outputs a factorized HC-tree ( T, δ ) ( k ) of G ( k ) then G is well-behaved. The prism P is the graph on six vertices shown in Figure 2. It is non-bipartite, and its complement is a cycle. P exhibits a high degree of symmetry(it is vertex-transitive), and thus has a limited number of non-isomorphic de-compositions. The optimal HC-tree we will base our normalization procedurearound is also shown in Figure 2, and has the maximum cost of 48 (note P hasalso another optimal HC-tree). To be convinced that this is indeed optimal, notethat in a minimum optimal HC-tree ( T, δ ) of its complement, every subgraphinduced by a node in T must be connected if the whole graph is connected. Wewill show in Section 5 a normalization procedure for the prism as described inProperty 2 to establish the following: Lemma 1.

The prism is max-well-behaved, and thus C is min-well-behaved. This result is non-trivial, and should be seen in light of e.g. the ﬁve-vertexgraph in Figure 1, whose complement is a 3-cycle and two isolated vertices, thatis not max-well-behaved.

Dasgupta shows that for edge-weighted graphs, ﬁnding an HC-tree of maxi-mum DC-cost is NP-hard, by reduction from an NP-complete problem he calledNAESAT*:

Deﬁnition 4 (NAESAT*) . We are given a boolean CNF formula where everyclause contains either two or three literals (called ”2-clauses” and ”3-clauses”,respectively), and every variable appears in exactly one 3-clause, and in exactlytwo 2-clauses with one appearance positive and the other negative. Moreover,no 2-clause nor its copy with polarities reversed is part of any 3-clause. Is therea not-all-equal-satisfying assignment, i.e. one where every clause contains atleast one true and one false literal?Dasgupta ﬁrst gave a simple reduction from NAE3SAT, where every clausehas exactly 3 literals but there is no restriction on how many times each variableappears in the formula, to NAESAT*. In that reduction it follows trivially thatno 2-clause nor its copy with polarities reversed will be contained in a 3-clause,so we have included that property in our deﬁnition of NAESAT*. We willassume, as Dasgupta [6] does, that if there is a 2-clause C whose literals alsoappear in a 2-clause C (cid:48) , but with reversed polarity, then C (cid:48) is removed.Dasgupta’s reduction to hierarchical clustering takes as input a NAESAT*formula ϕ on n variables with m = n m (cid:48) ≤ n G with two vertices for each variable x appearing in theformula ϕ : one corresponding to x and one to x . For every 2-clause (˜ x ∨ ˜ y ),6here a variable with a tilde above, ˜ x , is shorthand for ” x or x ”, he adds anedge between ˜ x and ˜ y , and also between ˜ x and ˜ y (these 2 m (cid:48) edges are calledthe ). For every 3-clause (˜ x ∨ ˜ y ∨ ˜ z ), he adds a triangle between˜ x , ˜ y and ˜ z , and also between ˜ x , ˜ y and ˜ z (these 6 m edges are called the ). In addition, he adds one edge between x and x for every variable (these n edges are called the matching edges ). He shows that ϕ is in NAESAT* if andonly if G has weighted DC-cost at least M (for some ﬁxed M that we do notspecify here). Let us see how this comes about. Given a not-all-equal assignmentof truth values to the n variables of ϕ , he constructs an HC-tree of G by ﬁrstsplitting V ( G ) evenly at the root into True literals and False literals and thensplitting all remaining edges at the next level.This HC-tree cuts all n matching edges at the top since x and x have oppositetruth values. Since the assignment is not-all-equal satisfying all 2 m (cid:48) m of the 6 m m + 2 m (cid:48) + n are cut at the top. The remaining 2 m nm + 1)to ensure that any HC-tree of weighted DC-cost M will be a tree that cuts allmatching edges at the top. Note that an HC-tree cutting all matching edges atthe top will naturally deﬁne a truth assignment to the variables of the formula.We will show the same result even when all edges have unit weight; this willimply the following: Theorem 1.

Hierarchical clustering of unweighted graphs is NP-hard.Proof.

Let the graph G constructed by the Dasgupta reduction when given ϕ be unweighted. What is then the cost of the HC-tree described above on G ,given some not-all-equal assignment of the underlying Boolean formula ϕ ? Asdescribed above, in G there are 4 m + 2 m (cid:48) + n edges that are cut at the top andeach receive a cost of 2 n , and 2 m edges that are split at the next level and eachreceive a cost of n . The total cost is thus W ∗ = 10 nm + 4 nm (cid:48) + 2 n . We havealready argued that if ϕ is not-all-equal-satisﬁable then DC-cost of G is at least W ∗ , but now we need to argue the converse. If we restrict to HC-trees thatsplit V ( G ) into two equally big parts, then we see that W ∗ is the maximumpossible and it can only be reached if the resulting assignment is not-all-equalsatisfying. This is because it will have to cut all matching edges at the top andfurthermore there is no way to cut more than two edges of a triangle in a singlesplit.It remains to show that an HC-tree not splitting V ( G ) evenly at the top willhave DC-cost less than W ∗ . To this purpose, we partition the edges of G into twosubgraphs G (cid:48) and G (cid:48)(cid:48) , with G (cid:48) being the graph containing only the 2 m (cid:48) G (cid:48)(cid:48) containing the 3-clause edges and matching edges. We observethat the 3-clause edges comprise 2 m disjoint triangles, and that the matchingedges bind together pairs of triangles, as shown in Figure 2. This means that G (cid:48)(cid:48) is a collection of m disjoint prisms. The graph G (cid:48) is also easy to describe;every variable appears in either one or two 2-clauses. It will belong to a single7-clause when there was a 2-clause C whose literals also appeared with reversedpolarity in a 2-clause C (cid:48) and C (cid:48) was removed, otherwise it will belong to two2-clauses. Thus G (cid:48) will be a collection of disjoint components that are 1-regular(single edges) or 2-regular (cycles). Since G (cid:48) is a collection of edges and cyclesit is easy to see that no HC-tree whose root is an uneven split can cut all its 2 m (cid:48) edges at the top. From Property 1 we know that for an HC-tree ( T, δ ) of G wehave DC - cost ( G, ( T, δ )) = DC - cost ( G (cid:48) , ( T, δ ))+ DC - cost ( G (cid:48)(cid:48) , ( T, δ )). Thus, for anuneven HC-tree (

T, δ ) of G to have cost at least W ∗ , then DC - cost ( G (cid:48)(cid:48) , ( T, δ ) (cid:48) )must be strictly higher than W ∗ − nm (cid:48) since G (cid:48) would contribute less than4 nm (cid:48) . By the equality n = 3 m , we get W ∗ − nm (cid:48) = 10 mn + 2 n = 30 m + 18 m = 48 m so that G (cid:48)(cid:48) must contribute more than 48 m . But our main Lemma 1 showingthat the prism is well-behaved, implies that 48 m is the maximum cost achiev-able for G (cid:48)(cid:48) being m copies of the prism. It must then be the case that there isno uneven HC-tree of G with cost at least W ∗ .We conclude that there exists an HC-tree of G with weight at least 10 nm +4 nm (cid:48) + 2 n if and only if the underlying Boolean formula is not-all-equal satis-ﬁable. ˜ x ˜ x ˜ y ˜ y ˜ z ˜ z ˜ x ˜ x ˜ y ˜ y ˜ z ˜ z ˜ x ˜ y ˜ z ˜ x ˜ y ˜ z Figure 2: The prism P , made from 3-clause edges and matching edges. By ourdeﬁnition of NAESAT*, every 3-clause in ϕ is represented in G . To the middleand right, one possible HC-tree of P with maximum DC-cost, and the top splitof this tree. We give a normalization procedure for G = P ( k ) = P ∪ P ∪ . . . ∪ P k consistingof k disjoint copies of the prism P . This procedure takes as input an HC-tree for G , performs a series of safe operations, and outputs a factorized HC-tree whereevery prism is clustered according to the evenly balanced HC-tree T in Figure2. We could have done this naively by a single Bottom-Up traversal of the tree,performing some PowerfulBalancing operation on each node t of the tree. Forevery possible split of a subgraph of a prism at node t , PowerfulBalancing would8ave to perform a safe operation that changes this split into one that is closerto the desired end goal. However, the number of subgraphs of a prism, and thenumber of distinct splits of these subgraphs is very high, 11 and 83 respectively.Thus the naive PowerfulBalancing is not a practical option to try and provethat the prism is well-behaved. Instead, our normalization procedure will lowerthe number of distinct subgraphs and splits of these subgraphs that appear in anode of the tree before doing the Balancing. In total, we employ 3 subroutinesat each node t of the tree: • Cut Optimization: ensures that every sub-prism split at t involves one ofthe 6 subgraphs given in Figure 3 and is split according to one of 8 speciﬁcsplits plus 6 distinct mirror-images. • Left-Heavy Distribution: ensures that no sub-prism split at t has thesubgraph in the right child bigger than the one in the left child, restrictingto the 8 distinct splits; Figure 5 depicts these splits. • Balancing: ensures that every sub-prism split at t is split as evenly aspossibleThe normalization procedure will make 2 traversals of the tree: the ﬁrst is a Top-Down traversal that will perform Cut Optimization on each node, the second isa Bottom-Up traversal that on each node will perform Left-Heavy Distributionfollowed by Balancing. Algorithm 1

This pseudocode outlines in which manner the subroutines arecalled on the HC-tree (

T, δ ). function Normalize ( G :graph, ( T, δ ):HC-tree, t ∈ V ( T )) if t ∈ L ( T ) thenreturnend if c l , c r ← Children of t in Tδ ← Cut Optimization (cf. Section 5.1) on δ with regards to G [ t ] Normalize (( T, δ ) , c l ) Normalize (( T, δ ) , c r )( T, δ ) ← Left-Heavy (cf. Section 5.2) on (

T, δ ) with regards to G [ t ]( T, δ ) ← Balancing Out (cf. Section 5.3) on (

T, δ ) with regards to G [ t ] end functionfunction Normalization ( G :graph, ( T, δ ):HC-tree) r ← Root of T Normalize ( G ,( T, δ ), r ) end function For every prism P i in G and every internal node t in T , we deﬁne P i [ t ] to bethe subgraph of P i that lies inside the cluster at t : P i [ t ] = P i ∩ G [ t ]. Each step9f the procedure works on each of these subgraphs, striving to optimize the waythese subgraphs are split.In the next section we show that after the Cut Optimization is done on allnodes of the tree, every subgraph P i [ t ] is one of the six subgraphs S , . . . , S that are depicted in Figure 3. This means that in the continuation we only haveto consider splits involving these subgraphs.We introduce some symbolic notation to easily talk about these splits. Let t be an internal node in the HC-tree T and let c l and c r be its children. Let P i [ t ] be any subgraph. If we have done Cut Optimization on ( T, δ ), we knowthat P i [ t ], P i [ c l ] and P i [ c r ] are isomorphic to some S a , S a l and S a r , respectively.Then we denote the split of P i at t as S a → ( S a l , S a r ). S S S S S S Figure 3: The sub-prisms arising from optimal splitsWe must say a few words on what it means for a subtree of an HC-treeto be fully normalized, i.e. after we have performed Balancing on the root ofthe subtree. The end goal is clear: when we are ﬁnished, i.e. when we haveperformed Balancing on the root r of T , we want every prism being split intotwo S ’s at the root, and those S ’s split into S ’s and S ’s at the children ofthe root, as seen in Figure 2. But when dealing with the subtree T [ t ] for anode t further down the tree, the subgraphs involved can be any S a . Thereforewe deﬁne ”fully normalized” as every such S a in the subtree T [ t ] being splitthe same way, for all a . The allowed splits are S → ( S , S ), S → ( S , S ), S → ( S , S ) and S → ( S , S ).The next sections are devoted to proving that in our normalization proce-dure, both the top-down traversal is a safe operation, performing Cut Opti-mization on every node, and also the subsequent bottom-up traversal is a safeoperation, performing Left-Heavy Distribution followed by Balancing on everynode of the tree. Let G = P ( k ) be k disjoint prisms, and let ( T, δ ) be any HC-tree of G . We lookat some node t ∈ T . Every subgraph P i [ t ] is split into two subgraphs P i [ c l ]and P i [ c r ], with some r and s vertices, respectively. Not every way to split onegraph into two subgraphs with given numbers of vertices is equally good. Theoptimal split of P i [ t ] into subgraphs with r and s vertices, is simply the split10hat cuts the most edges. Observation 1.

Let G and ( T, δ ) as above. Let t be an internal node in T withchildren c l , c r , and assume that some P i [ t ] is split optimally. Furthermore, let S , . . . , S be the graphs depicted in Figure 3. Whenever P i [ t ] = S a for some a ,then P i [ c l ] = S a l and P i [ c r ] = S a r for some a l , a r .Proof. It is not hard to verify via simple counting that the subgraphs S , . . . , S have the minimal number of edges among the subgraphs of the prism. Sincethere, for any S a , S b with a + b ≤

6, exists a split of S a + b into S a and S b , thissplit must cut more edges than any other split of S a + b .Obtaining an optimal split is thus a matter of simply switching around ver-tices between P i [ c l ] and P i [ c r ]. Formally, switching vertices u and v in G withrespect to ( T, δ ) can be seen as an operation on δ , yielding a new bijection δ (cid:48) with the property that δ ( u ) = δ (cid:48) ( v ), δ ( v ) = δ (cid:48) ( u ), and for every vertex w (cid:54) = u, v , δ ( w ) = δ (cid:48) ( w ). This operation preserves the size of every subgraph of G inducedby ( T, δ ), therefore the only edges aﬀected are the ones that lie on u or v . Wethus conclude that every split that cuts some S a optimally, cuts it into S a l , S a r for some a l , a r . a bc de ⇒ a bcd e Figure 4: In Cut Optimization, we obtain an optimal cut from a suboptimalone by switching two vertices, in this case d and e . Note that b and c could alsobe used. Lemma 2.

For any node t ∈ T , Cut Optimization on ( T [ t ] , δ ) is a safe opera-tion.Proof. From the proof of Remark 1, we see that for all P i [ t ] that is isomorphic tosome S a , performing Cut Optimization is a safe operation, as it never decreasesthe DC-cost of ( T, δ ). Now, note that we perform this operation on each nodeof T in top-down fashion. At the root of T , r , we have that for every 1 ≤ i ≤ k , P i [ r ] = P = S , so the operation is safe on r . At any other node t , we havealready optimized the cuts in u , the parent of t . By Remark 1, we again havethat for every 1 ≤ i ≤ k , there exists some a such that P i [ t ] = S a . Therefore,the operation also is safe on every other node of T .11 → ( S , S ) S → ( S , S ) S → ( S , S ) S → ( S , S ) S → ( S , S ) S → ( S , S ) S → ( S , S ) S → ( S , S ) Figure 5: After Cut Optimization, every split of sub-prisms that cuts at leastone edge is one of the splits shown here or its mirror image. After Left Heavythe mirror images no longer appear.

Now we show that also Left-Heavy Distribution is a safe operation on each node.This step is performed after Cut Optimization, therefore we can assume everysplit in the HC-tree is an optimal one. Furthermore, since this step is done intandem with the Balancing step, on each node before moving up to its parent,we can assume that when performing Left-Heavy Distribution on some node t in T with children c l and c r , then T [ c l ] and T [ c r ] are already fully normalized.The goal of the second step, Left-Heavy Distribution, is to ensure that forevery i , | P i [ c l ] | ≥ | P i [ c r ] | . The intuition behind this step is clear: if we ﬁrstsplit one component unevenly, we would expect more uncut edges in the bigpart than in the small part. Indeed, this is true for the subgraphs S , . . . , S ; S a does not have more edges than S a +1 for any a ∈ { , . . . , } . Splitting allcomponents unevenly with the big part on the same side, we give more weightto these remaining edges when they are cut, further down in T .We begin by dividing G [ t ] into two pieces, G [ t ] L and G [ t ] R . G [ t ] L is theunion of all those P i [ t ] for which | P i [ c l ] | ≥ | P i [ c r ] | (the left-heavily split sub-graphs), while G [ t ] R is the union of all those P i [ t ] for which | P i [ c l ] | < | P i [ c r ] | (the right-heavily split subgraphs). G [ t ] L and G [ t ] R are clearly disjoint, sinceevery connected subgraph lies wholly within one of these parts. We make acouple of observations about these two subgraphs: Observation 2.

Every edge in G [ c l ] is also in G [ t ] L , and every edge in G [ c r ] except those arising from (3-3)-splits is also in G [ c r ] .Proof. We begin looking at G [ c l ]: As we have performed Cut Optimizationon the HC-tree, we can assume that P i [ c l ] is isomorphic to S a l for some a l ∈{ , . . . , } for every i , and equivalently every P i [ c r ] is isomorphic to some S a r .Now, for any P i [ t ], if this subgraph has been put into G [ t ] R it is because it has12een split right-heavily, i.e. a l < a r . Since a l + a r is at most 6, is follows that a l is at most 2. But the optimal subsets of the prism that contain edges all haveat least 3 vertices, therefore P i [ t ] cannot contain any edges.The proof for G [ c r ] is roughly equivalent to the one above, but we have tofactor in that there can exist some P i [ c r ] in G [ t ] L that is isomorphic to S . If thisis the case, then we know that P i [ c l ] also must be isomorphic to S , therefore P i [ t ] is a prism that is split (3-3)-wise. Observation 3.

Let ( T, δ ) be a HC-tree, and t a node with children c l , c r . Wegive the children of c l and c r names l , l and r , r respectively. Furthermore,we give the children of these 4 nodes names x , x , x , x , y , y and y , y respectively. If T [ c l ] and T [ c r ] are fully normalized, then for every i ∈ { , . . . , } , G [ x i ] and G [ y i ] have no edges.Proof. Assume that T [ c l ] and T [ c r ] are fully normalized. By deﬁnition, we knowthat all the subgraphs in G [ c l ] and G [ c r ] have been split optimally as balancedas possible. This means that all the subgraphs in G [ l ], G [ l ], G [ r ] and G [ r ]have at most 3 vertices. These subgraphs are also split optimally and balanced.This means that for any T [ x i ] or T [ y i ], every subgraph is isomorphic to eitherof ∅ , S , s and thus have no edges.When explaining the operation, we assume that the nodes have the samenames as in Remark 3. From here, we identify the nodes that are children of l , l , r and r . We then switch around all the subgraphs that are split right-heavy,so they become left-heavy split. Figure 6 shows this operation. Speciﬁcally, wemodify ( T, δ ) into ( T (cid:48) , δ (cid:48) ) such that for each pair of nodes x i , y i ∈ T (cid:48) , we have G ( T (cid:48) ,δ (cid:48) ) [ x i ] = ( G ( T,δ ) [ x i ] ∩ G [ t ] L ) ∪ ( G ( T (cid:48) ,δ (cid:48) ) [ y i ] ∩ G [ t ] R ) G ( T (cid:48) ,δ (cid:48) ) [ y i ] = ( G ( T,δ ) [ x i ] ∩ G [ t ] R ) ∪ ( G ( T (cid:48) ,δ (cid:48) ) [ y i ] ∩ G [ t ] L )13 x x x x y y y y l l r c l c r t T Figure 6: The circles beneath each node x i (or y i ) represents G ( T,δ ) [ x i ] (or G ( T,δ ) [ y i ]); the colored halves represent the sub-prisms that are right-heavilysplit at t , i.e. the union of all those P i [ t ] for which | P i [ c l ] | < | P i [ c r ] | . In theLeft-Heavy Distribution operation, we switch each two colored parts with thesame number. Lemma 3.

Left-Heavy Distribution on any node t is a safe operation.Proof. As implied by Remark 3, none of the subgraphs G [ x i ] or G [ y i ] have anyedges. This means that for every i , any HC-tree of G ( T (cid:48) ,δ (cid:48) ) [ x i ] or G ( T (cid:48) ,δ (cid:48) ) [ y i ] hasDC-cost zero. When this step is done, every edge in G [ t ] is cut at one of thenodes t , c l , c r , l or l . It is also evident that every edge is cut in a subgraphthat is at least as big in T (cid:48) as it was in T , except the edges in c r . FollowingRemark 2, these edges must necessarily follow from a S → ( S , S ) split at t .The decrease in cost for these edges are therefore matched by the increase incost for the other S that is split at c l . It follows that ( T (cid:48) , δ (cid:48) ) has at least ashigh DC-cost as ( T, δ ). Note that every subgraph in T (cid:48) [ c l ] and T (cid:48) [ c r ] is stillfully normalized, since they are split the same way as before. Let t be a node of HC-tree ( T, δ ) on which we have just performed Left-HeavyDistribution. This means that every split at a node t is optimal and left-heavy,and also that we have performed Balancing on both its children c l , c r , so that T [ c l ] , T [ c r ] are both fully normalized. In the Balancing step we fully normalize T [ t ]. Since splits at the children are left-heavy, there are 12 possible splits ofsub-prisms at t before we perform Balancing. These are the 8 in Figure 5 plus 4not cutting any edge. 4 of these 12 (the ﬁrst 4 in below) are as even as possible,while 8 are uneven. 14 a splits of type S → ( S , S ) • b splits of type S → ( S , S ) • c splits of type S → ( S , S ) • d splits of type S → ( S , S ) • a (cid:48) splits of type S → ( S , ∅ ) • b (cid:48) splits of type S → ( S , S ) • c (cid:48) splits of type S → ( S , S ) • d (cid:48) splits of type S → ( S , ∅ ) • e (cid:48) splits of type S → ( S , S ) • f (cid:48) splits of type S → ( S , ∅ ) • g (cid:48) splits of type S → ( S , S ) • h (cid:48) splits of type S → ( S , ∅ )The Balancing step is done as follows: Each uneven split of a sub-prism ismodiﬁed into the unique even split on the same sub-prism, by way of movingsome vertices from the left side over to the right side. Figure 7 shows the detailsof this operation. In the resulting HC-tree, the sub-prisms are not necessarilysplit left-heavily in c l or c r anymore. This does not aﬀect the cost, as thesenodes are the lowest that cut edges. We still ﬂip the left and right side of thesesub-prisms to guarantee the behavior of performing Left-Heavy distribution onthe parent of t .As an example of this type of modiﬁcation, consider a sub-prism that is split S → ( S , S ) before the modiﬁcation. We will modify it into S → ( S , S ).In this case, we move one single vertex from the left side to the right side. Tooptimize the split, we must pick the one vertex that is not adjacent to the vertexalready lying on the right side. However, note that these movements of verticesfrom left subtree to right subtree aﬀect also the cost of edges belonging to evensplits, and thus Figure 7 shows also the eﬀects on even splits.For every possible split, we have denoted the number of sub-prisms that aresplit this way at t with a letter as shown above, where the letters a to d arereserved for even splits and ticked letters a (cid:48) through h (cid:48) are reserved for unevensplits.From Remark 3, we know that before the Balancing step at t , every edge in G [ t ] is cut at one of the nodes t , c l , c r , l and l (where the nodes are namedas in Figure 6). After the modiﬁcation, every edge in G [ t ] is cut at one of thenodes t , c l and c r in ( T (cid:48) , δ (cid:48) ). How much is gained and lost for each type of splitis shown in Figure 7. Lemma 4.

In the bottom-up traversal the Balancing operations collectively con-tribute to making this bottom-up traversal a safe operation.Proof.

Assume Balancing has been performed at a node t as explained above,with the letters a, ..., d, a (cid:48) , ...h (cid:48) denoting the number of sub-prisms before theBalancing of each of the 12 types. To calculate the change in cost, we must lookat the sizes of subgraphs of G [ t ], with A the number of leaves of the subtreerooted at left child before Balancing at t and A (cid:48) this number after the balancingat t , and similarly for B, B (cid:48) , C (remember that (

T, δ ) is the tree before this stepand ( T (cid:48) , δ (cid:48) ) is the modiﬁed HC-tree): • A := | G ( T,δ ) [ c l ] | = 6( a (cid:48) )+5( b (cid:48) + d (cid:48) )+4( c (cid:48) + e (cid:48) + f (cid:48) )+3( a + b + g (cid:48) + h (cid:48) )+2( c + d )15 tc l c r c l c r a · S → ( S , S ) b · S → ( S , S ) Gain − Loss =1( A (cid:48) − A ) − B − B (cid:48) ) = 0 Gain − Loss = − A (cid:48) − A ) l l r r l l r r c l c r l l r r ta (cid:48) · S → ( S , ∅ ) tc l c r l l r r tc l c r l l r r tc l c r l l b (cid:48) · S → ( S , S ) c (cid:48) · S → ( S , S ) d (cid:48) · S → ( S , ∅ ) c · S → ( S , S ) tc l c r d · S → ( S , S ) tc l c r Gain − Loss =8 B Gain − Loss =0Gain − Loss =0Gain − Loss =5 B − C Gain − Loss =2 B − A Gain − Loss =5 B + 1( A (cid:48) − C ) tc l c r l l e (cid:48) · S → ( S , S ) Gain − Loss =2 B + 1( A (cid:48) − A ) f (cid:48) · S → ( S , ∅ ) tc l c r Gain − Loss =3 B g (cid:48) · S → ( S , S ) tc l c r Gain − Loss =1 B h (cid:48) · S → ( S , ∅ ) tc l c r Gain − Loss =1 B Figure 7: This ﬁgure shows every type of split that gets some edges modiﬁed inthe Balancing step, after the modiﬁcation. Green edges have gained cost andred edges have lost cost. Edges whose cost do not change are not shown. • A (cid:48) := | G ( T (cid:48) ,δ (cid:48) ) [ c l ] | = 3( a + b + a (cid:48) + b (cid:48) + c (cid:48) + d (cid:48) + e (cid:48) ) + 2( c + d + f (cid:48) + g (cid:48) + h (cid:48) ) • B := | G ( T,δ ) [ c r ] | = 3( a ) + 2( b + c + c (cid:48) ) + 1( d + e + b (cid:48) + e (cid:48) + g (cid:48) ) • B (cid:48) := | G ( T (cid:48) ,δ (cid:48) ) [ c r ] | = 3( a + a (cid:48) + b (cid:48) + c (cid:48) )+2( b + c + d (cid:48) + e (cid:48) + f (cid:48) + g (cid:48) )+1( d + h (cid:48) ) • C := | G ( T,δ ) [ l ] | ≤ a (cid:48) + b (cid:48) + d (cid:48) ) + 2( a + b + c (cid:48) + e (cid:48) + f (cid:48) + g (cid:48) + h (cid:48) + c + d ) • N := | G [ t ] | = A + B = A (cid:48) + B (cid:48) Back to our example, we see in Figure 7 that in each of the e (cid:48) sub-prisms thatused to be split S → ( S , S ) there are 3 edges that have their cost changed,for two of them a gain of B = ( A + B ) − A since these edges used to be on the16eft side but are now cut at t , while one edge incurs a loss of A − A (cid:48) since the leftside has shrunk in size. The net gain (Gain minus Loss) for these e (cid:48) sub-prismsis thus e (cid:48) (2 B − A + A (cid:48) ).The net gain for all sub-prisms split at t is found by summing in a similarway the net gain for all the 12 cases. Into this total net gain we now plug thedeﬁnitions of A, A (cid:48) , B, B (cid:48) , C, N given above, to get a large sum of products ofpairs of the variables a, ..., d, a (cid:48) , ..., h (cid:48) . After a simple, but tedious reorganizingof this sum each pair will be multiplied by a coeﬃcient in this total net gain;these coeﬃcients are shown in Table 1.In this sum, every coeﬃcient is non-negative, except for two terms: − b (cid:48) h (cid:48) and − c (cid:48) h (cid:48) . This means that if G [ t ] consists of only S → ( S , S )’s (denotedby c (cid:48) ) and S → ( S , ∅ )’s (denoted by h (cid:48) ), then the modiﬁed ( T (cid:48) , δ (cid:48) ) actuallyhas lower DC-cost than the original (

T, δ ). In other words, not every call toBalancing will be safe. But in every ancestor of t , the c (cid:48) S → ( S , S )’s are S → ( S , ∅ )’s, and the h (cid:48) S → ( S , ∅ )’s will at some ancestor be involved inone of S → ( S , S ), S → ( S , S ) or S → ( S , S ). The coeﬃcients for thesecombinations in the sum are 8, 13 and 24, respectively. Therefore, even whenincluding these combinations of sub-prisms, the cost for these sub-prisms mustincrease more at the ancestors of t than it decreases at t . The same argumentcan be put forward for the combination − b (cid:48) h (cid:48) . This implies that no pair ofsub-prisms contributes a lower DC-cost in the ﬁnished, factorized HC-tree thanat the start of the bottom-up traversal. a (cid:48) b (cid:48) c (cid:48) d (cid:48) e (cid:48) f (cid:48) g (cid:48) h (cid:48) a

24 13 3 16 6 9 3 3 b

13 6 0 9 3 4 1 1 c

16 8 2 10 4 6 2 2 d a (cid:48) b (cid:48) x 2 5 2 3 1 4 -1 c (cid:48) x x 0 6 1 2 1 -1 d (cid:48) x x x 0 4 0 5 0 e (cid:48) x x x x 1 1 2 0 f (cid:48) x x x x x 0 3 0 g (cid:48) x x x x x x 1 1 h (cid:48) x x x x x x x 0Table 1: The coeﬃcients associated with each pair of variables, in the formulafor net gain after modiﬁcation of the HC-tree ( T [ t ] , δ ). That is, net gain is equalto 24 aa (cid:48) + 13 ab (cid:48) + . . . + 1 g (cid:48) h (cid:48) + 0 h (cid:48) h (cid:48) . Note the two negative numbers. Lemma 5.

The top-down traversal of ( T, δ ) in which Cut Optimization is per-formed is a safe operation. The bottom-up traversal of ( T, δ ) in which Left-HeavyDistribution and Balancing is performed is a safe operation. roof. Lemma 2 has already established that the top-down traversal consists ofa series of safe operations and is therefore itself a safe operation, i.e. the DC-cost of the HC-tree that was given as input is no higher than the DC-cost ofthe HC-tree after top-down traversal. By Lemma 3 the Left-heavy Distributionon each node is also safe. By Lemma 4 the combined result of all the Balancingoperations together imply that the bottom-up traversal is also a safe operation,i.e. the DC-cost of the HC-tree resulting from the top-down traversal doesnot have DC-cost higher than the DC-cost of the HC-tree after the bottom-uptraversal.

Lemma 5.

The prism P is max-well-behaved, and thus C is min-well-behaved. Proof.

We have demonstrated a safe normalization procedure that works forany k and any HC-tree of G = P ( k ) as described by Property 2. Safeness ofthe procedure follows from the safeness of the two steps, both the top-downtraversal and the bottom-up traversal, as established by Lemma 5. This meansthat no HC-tree of G = P ( k ) has DC-cost higher than the tree output by thenormalization procedure. This output tree is a factorized HC-tree since at itsroot node r every connected subgraph P i [ r ] of G [ r ] is the prism S and everyprism at r is split into two S ’s, which are further split into the independent sets S and S , as in Figure 2. This decomposition is thus the factorized HC-tree,of DC-cost 48 k . We leave as an open problem the complexity of deciding if a graph is max ormin well-behaved. A related question arises if we assume that we are given anHC-tree T of max DC-cost for a graph H and also an integer k , and we askfor an HC-tree of max DC-cost for H ( k ) . Note that the equivalent min DC-costversion of this problem, where adjacency denotes similarity, instead looks at thejoin of k copies, i.e. a dense graph where an edge is added between any twovertices from distinct copies. It is not clear to us if these problems on k copiesare solvable in polynomial time, even though we assume an optimal HC-tree isgiven for a single copy. References [1] N. Bansal, A. Blum, and S. Chawla. Correlation clustering.

MachineLearning , 56(1-3):89–113, 2004.[2] P. Buneman. The recovery of trees from measures of dissimilarity.

Mathe-matics in the Archaeological and Historical Sciences , pages 387–395, 1971.[3] S. Chakrabarti, M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita,G. Piatetsky-Shapiro, and W. Wang. Data mining curriculum: A proposal18version 1.0). Technical report, Intensive Working Group of ACM SIGKDD,2006.[4] M. Charikar and V. Chatziafratis. Approximate hierarchical clustering viasparsest cut and spreading metrics. In

Annual ACM-SIAM symposium onDiscrete algorithms (SODA) , pages 841–854, 2017.[5] V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. Hier-archical clustering: Objective functions and algorithms.

Journal of ACM ,66(4):26:1–26–42, 2019.[6] S. Dasgupta. Hardness of hierarchical clustering optimization. Privatecommunication, 2019.[7] S. Dasgutpa. A cost function for similarity-based hierarchical clustering. In

Annual ACM symposium on Theory of Computing (STOC) , pages 118–127,2016.[8] R. Diestel.

Graph theory . Springer-Verlag, 2005.[9] J. Hartigan.

Clustering algorithms . John Wiley and Sons, 1975.[10] T. Hastie, R. Tibshirani, and J. Friedman.

The elements of statistical learn-ing: data mining, inference, and prediction . Springer Series in Statistics.Springer, second edition, 2009.[11] K. Koutroumbas and S. Theodoridis.

Pattern recognition . Academic Press,fourth edition, 2009.[12] R. Sokal and P. Sneath.