[PDF] Hierarchical Clustering via Sketches and Hierarchical Correlation Clustering

Abstract

Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization. In particular, two maximization objectives have been defined. Moseley and Wang defined the \emph{Revenue} objective to handle similarity information given by a weighted graph on the data points (w.l.o.g., [0,1] weights), while Cohen-Addad et al. defined the \emph{Dissimilarity} objective to handle dissimilarity information. In this paper, we prove structural lemmas for both objectives allowing us to convert any HC tree to a tree with constant number of internal nodes while incurring an arbitrarily small loss in each objective. Although the best-known approximations are 0.585 and 0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not all weights are small (i.e., there exist constants \epsilon, \delta such that the fraction of weights smaller than \delta, is at most 1 - \epsilon); such instances encompass many metric-based similarity instances, thereby improving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) to handle instances that contain similarity and dissimilarity information simultaneously. For HCC, we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights (analogous to +/- correlation clustering), we again present nearly-optimal approximations.

Full PDF

HHierarchical Clustering via Sketches and Hierarchical CorrelationClustering

Danny Vainstein ∗ Vaggos Chatziafratis † Gui Citovsky † Anand Rajagopalan † Mohammad Mahdian † Yossi Azar ‡ January 27, 2021

Abstract

Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization.In particular, two maximization objectives have been deﬁned. Moseley and Wang deﬁned the

Revenue objective to handle similarity information given by a weighted graph on the data points(w.l.o.g., [0 ,

1] weights), while Cohen-Addad et al. deﬁned the

Dissimilarity objective to handledissimilarity information. In this paper, we prove structural lemmas for both objectives allowingus to convert any HC tree to a tree with constant number of internal nodes while incurring anarbitrarily small loss in each objective. Although the best-known approximations are 0.585 and0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not allweights are small (i.e., there exist constants (cid:15), δ such that the fraction of weights smaller than δ , is at most 1 − (cid:15) ); such instances encompass many metric-based similarity instances, therebyimproving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) tohandle instances that contain similarity and dissimilarity information simultaneously. For HCC,we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights(analogous to + / − correlation clustering), we again present nearly-optimal approximations. Clustering is a fundamental problem in unsupervised learning and has been widely and intensivelyexplored. Classically, one considers a set of data points (with some notion of either similarityor dissimilarity between every pair) and then partitions these data points into sets. In orderto diﬀerentiate between diﬀerent partitions, many classical ﬂat clustering objectives have beenintroduced, such as k -means, k -median and k -center. However, what if one would like a moregranular view of the clusters (speciﬁcally, to understand the relations between data points withina given cluster)?To explore these questions, the notion of Hierarchical Clustering (HC) has been introduced.One way of studying this notion is through the lens of optimization. Dasgupta [2016] initiated thisline of work, inspiring others to consider several diﬀerent objectives. Two notable objectives thatwe will consider in our paper are the Revenue and Dissimilarity objectives. ∗ School of Computer Science, Tel-Aviv University and Google Research. Email: [email protected] † Google Research. Emails: { vaggos, gcitovsky, anandbr, mahdian } @google.com ‡ School of Computer Science, Tel-Aviv University. Email: [email protected]. Research upported in part by the IsraelScience Foundation (grant No. 2304/20 and grant No. 1506/16). a r X i v : . [ c s . D S ] J a n he problem is deﬁned as follows. We are given a set of data points with some notion ofsimilarity (or dissimilarity) between every pair of points which is deﬁned by a weighted graph, G = ( V, E, w ) such that V is our set of data points, | V | = n and w : E → R ≥ . We then deﬁnean HC tree as a rooted tree with leaves in bijective correspondence with the original data points.Intuitively, we would expect a ”good” HC tree T to split more similar data points towards theleaves of the tree. When we are given similarity weights, this corresponds to larger weights. Thus,Moseley and Wang [2017] proposed to maximize the Revenue objective: rev G ( T ) = (cid:88) i

Dis-HC the best ratio is 0.667 [Charikar et al., 2019a]. In terms of hardness, both problemshave been proven to be APX-hard [Ahmadian et al., 2019, Chatziafratis et al., 2020] and thusdo not admit optimal or even arbitrarily close to optimal approximations. Given these results, itseems natural to ask whether this hardness is inherent in the objectives, or rather can be somehowcircumvented. Towards that end, we consider the following question:

Is there a large class of interesting instances that can be shown to have signiﬁcantly betterapproximations?

Surprisingly, we show that if we consider instances with weights that are not all small (seeDeﬁnition 3) then the above holds true. First, we obtain approximations arbitrarily close to optimal(speciﬁcally, Eﬃcient Polynomial Time Randomized Approximation Schemes (Eﬃcient-PRAS))for both

Rev-HC and

Dis-HC objectives. Interestingly, in order to do so we ﬁrst consider a tree’s sketch (deﬁned as the tree resulting from removing all its leaves (and corresponding edges)). Eventhough it is well known that the optimal trees for these settings are binary (and therefore contain n − n ) nodes), we show that there exist trees with constant sized (i.e., a constant numberof nodes and edges) sketch, for both objectives, that approximate the optimal values arbitrarilygood. We stress that this holds true for any HC instance, and not only if not all inputweights are small.

We then leverage the seminal work of Goldreich et al. [1998] in order to obtainapproximations arbitrarily close to optimal, if not all weights are small.Second, we show that many interesting, and formerly researched problems, are encapsulated bythese types of instances. Speciﬁcally, we show that a large family of metric-based similarity instances(as deﬁned by Charikar et al. [2019b] - see Subsection 3.3) are such instances, and thus admitapproximations arbitrarily close to optimal. We note that this partially answers an open question2aised in their work of whether there exist good approximation algorithms for low dimensions. Wealso note that our results immediately provide an Eﬃcient-PRAS for similarity instances deﬁnedby a Gaussian Kernel in high dimensions when the minimal similarity is δ = Ω(1) which wasspeciﬁcally considered by Charikar et al. [2019b]; improving the approximation from δ to anapproximation that is arbitrarily close to optimal. Finally, we show that these results also providean approximation that is arbitrarily close to optimal, for the +/- Hierarchical Correlation Clusteringproblem (deﬁned next).Up until now we have only considered instances handling either similarity or dissimilarity in-formation, but not both . In many scenarios, however, both types of information are accessiblesimultaneously. These scenarios have been tackled within the realm of correlation clustering bothin theory (e.g., Bansal et al. [2002], Swamy [2004], Charikar et al. [2005], Ailon et al. [2008], Chawlaet al. [2015]) and in practice (e.g., Bonchi et al. [2014], Cohen and Richman [2001]). However, thisline of work has been centered around ﬂat clustering. With that in mind, it is natural to ask: In presence of mixed information, how can we extend the notion of Correlation Clustering tohierarchies?

In order to answer the question, we introduce the Hierarchical Correlation Clustering objective.The objective interpolates naturally between the

Rev-HC and

Dis-HC objectives. Again, we aregiven a set of data points; however, in this case every pair of data points i and j are given asimilarity weight w sij and a dissimilarity weight w dij . The objective is then deﬁned as, hcc G ( T ) = (cid:88) i

Rev-HC and

Dis-HC objectives simplyby letting either w dij = 0 or w sij = 0 respectively. Moreover, it captures the fact that similar points(i.e., large w sij ) should be separated towards the tree’s leaves (yielding a large n − | T ij | coeﬃcient),whereas dissimilar points (i.e., large w dij ) should be split towards the tree’s root (yielding a large | T ij | coeﬃcient).Finally, we consider the + / − variant of correlation clustering [Bansal et al., 2002] extended tohierarchies as well. We deﬁne this objective as the HCC objective reduced to instances that guarantee w sij = 1 − w dij for all data points i and j . We will refer to this objective as the HCC ± objective.This may be motivated by the following folklore example: assume one is given a document classiﬁer f that returns a conﬁdence level in [0 ,

1] corresponding to how certain it is that two documentsare similar. Thus, 1 minus the conﬁdence level may be seen as how conﬁdent the classiﬁer is thatthe two documents are dissimilar. For further comments regarding our formulation and how itis related to the correlation clustering objectives of Bansal et al. [2002] and of Swamy [2004], seeSection 6.

Contributions of this paper.

With respect to the

Rev-HC and

Dis-HC objectives: • We present structural lemmas for the revenue and dissimilarity settings that provide a wayof converting optimal trees in both settings such that the resulting trees (1) are of constantsketch size and (2) approximate the respective objectives arbitrarily close (see Figure 1 foran example). Note that this result holds for any similarity/dissimilarity input graphs.3

We use the resulting trees in order to obtain Eﬃcient-PRAS’s for revenue or dissimilarityinstances with not all small weights (see Deﬁnition 3). We note that this includes an Eﬃcient-PRAS for any similarity Guassian Kernel based instances with minimal weight δ = Ω(1)(speciﬁcally considered by Charikar et al. [2019b]). • We show that many metric-based similarity instances in fact do not have all small weights,thus admitting Eﬃcient-PRAS’s. We note that this partially solves the case where the metric’sdimension is constant (raised in Charikar et al. [2019b]).With respect to the

HCC objective: • We present a 0.4767 approximation for the

HCC objective by extending the proof of Alon et al.[2020] to include dissimilarity weights. • We combine our Revenue and Dissimilarity algorithms to produce an Eﬃcient-PRAS for the

HCC ± objective. Techniques.

In order to reduce HC trees to trees with constant sketch that approximate the

Rev-HC and

Dis-HC objectives arbitrarily closely, we use the following techniques. For both ob-jectives the ﬁrst step is to consider an optimal solution, T , and contract it (i.e., contract somesubgraphs of T into single nodes) into an intermediate tree denoted as K ( T ). Brieﬂy, K ( T ) isgenerated by recursively ﬁnding a constant-sized set of edges whose removal creates a set of trees,each containing a small and roughly equal number of data points. Thereafter, each such tree iscontracted (within T ) to a single node. This results in K ( T ) that guarantees that (1) it contains aconstant number of nodes and (2) its structure resembles that of T which allows us to easily convertit to the ﬁnal revenue/dissimilarity tree. Note that during this process of contraction, some datapoints may have been contracted as well (see Figure 2). Next we describe, at a high level, how toconvert K ( T ) to a proper revenue/dissimilarity tree. Revenue setting.

In the revenue setting we convert K ( T ) to a tree denoted by T R , such that T R has a constant-sized sketch and approximates the revenue gained by T up to an arbitrarily smallconstant factor. In order to do so we replace each contracted node in K ( T ) with a “star” structure(which is an auxiliary node with the contracted data points connected as its children) - see Figure3. Note that there is a trade-oﬀ between T R ’s internal tree size and the revenue approximationfactor guaranteed (see Section 3 for formal details). Dissimilarity setting.

In the dissimilarity setting we convert K ( T ) to a tree denoted by T D such that T D has a constant-sized sketch and approximates the dissimilarity gained by T up to anarbitrarily small constant factor. Instead of replacing the contracted node with a “star” structure asin the revenue case, we replace it with a random “comb” structure (formally deﬁned in Section 4 anddepicted in Figure 3). Also here, there exists a trade-oﬀ between T D ’s size and the approximationfactor. Related Work.

HC has been extensively studied and therefore many variations have been con-sidered (for a survey on the subject, see Berkhin [2006]). The work on HC trees began withinthe realm of phylogenetics [Sneath and Sokal, 1962, Jardine and Sibson, 1968] but has since thenexpanded to many other domains (e.g., genetics, data analysis and text analysis - Alon et al. [1999],Brown et al. [1992], Seo and Shneiderman [2002]).As stated earlier, Dasgupta elegantly linked the ﬁelds of approximation algorithms and HC trees,thereby initiating this line of work. Formally, given an HC tree, T , Dasgupta [2016] considered4he problem of minimizing its cost, cost G ( T ) = (cid:80) w ij | T ij | . In his work, Dasgupta showed thatrecursively ﬁnding a sparsest cut results in a O (log . n ) approximation. This analysis was laterimproved to O ( √ log n ) [Charikar and Chatziafratis, 2017, Cohen-Addad et al., 2018]. Charikar andChatziafratis [2017] also showed that no constant approximation exists (assuming the Small SetExpansion hypothesis).Later, Moseley and Wang [2017] considered the Rev-HC objective (deﬁned earlier). Charikaret al. [2019a] showed a 0 . Max-Uncut Bisection problem in order to prove a 0 . .

585 approximation,by proving the existence of a bisection which yields large revenue.Cohen-Addad et al. [2018] considered the

Dis-HC objective (deﬁned earlier). In their work theyshowed that the Average-Linkage algorithm is a approximation and then improved upon this bypresenting a simple algorithm achieving a approximation. Charikar et al. [2019a] then showed afurther improvement by presenting a more intricate algorithm that achieves a 0 . d d d d d d d d d d ,d ,d ,d ,d d ,d ,d d T K ( T ) Figure 2: Converting an HC tree T to K ( T ). d ,d ,d ,d ,d d d d d d d d d d d K ( T ) Dis. ReductionRev. Reduction Auxiliary Node Auxiliary Nodes

Figure 3: Converting K ( T ) to an HC tree for each goal function.5 PRELIMINARIES

We ﬁrst consider several graph-speciﬁc deﬁnitions.

Deﬁnition 1.

Given a tree T and a set of edges F ⊂ E ( T ) , let T − F denote the set of trees thatresults from removing F from E ( T ) . Furthermore, given a set of nodes U ⊂ V ( T ) , let T − U denotethe set of trees that results from removing U (and any edge that has a node in U ) from T . Deﬁnition 2.

Given a graph G and a subset of edges U ⊂ V ( G ) we deﬁne the contraction of U asthe replacement of U within G with a single node attached to all edges which were formerly attachedto U . As pointed out by Charikar et al. [2019a], the average-linkage algorithm generates ( n − (cid:80) w ij revenue and n − (cid:80) w ij dissimilarity, yielding the following facts: Fact 2.1. rev ( T O ) ≥ ( n − (cid:80) i

Even though the

Rev-HC and

Dis-HC objectives are deﬁnedfor binary trees, we make use of star structures. A star structure is simply a node that containsmore than two data points as children (and therefore leaves). We use these star structures as aproxy for any binary tree containing the same set of data points. More formally, by replacing thestar structure (within some larger tree) with any binary tree containing the same set of data pointsand then rooting it in the same place within the original tree, the goal function would only increase.In the revenue case this follows immediately. In the dissimilarity case, however, by following thedeﬁnition of T ij plainly, clearly attaching all data points to a single root results in an optimal tree.Therefore, we instead extend the dissimilarity deﬁnition to non-binary trees as follows. Given anHC tree T and internal node v , let | T v | denote the set of data points contained within the subtreerooted at v (in particular, for any 2 data points i and j , | T ij | = | T lca ( ij ) | ). We then deﬁne thedissimilarity as dis G ( T ) = (cid:88) w ij ( | T v i | + | T v j | ) , where v i and v j denote lca ( i, j )’s children containing i and j in their subtree. We emphasize the factthat for binary HC trees, this deﬁnition coincides with the classic dissimilarity (since | T v i | + | T v j | = | T ij | ). Clearly any non-binary node may be replaced with a binary subgraph within the HC treethereby only increasing the dissimilarity generated. Therefore, any of our algorithmic results applyto the binary setting (by performing these replacements). Further, all of our approximation resultsare with respect to optimal binary trees and thus directly apply to the binary setting.Finally, we will use the following deﬁnitions throughout the paper. (Recall that w.l.o.g. we mayassume that all weights are in [0 , Deﬁnition 3.

An HC instance is said to have not all small weights if there exist constants (withrespect to | V | ) ρ, τ such that the fraction of weights smaller than τ , is at most − ρ . Deﬁnition 4.

An algorithm is considered an Eﬃcient-PRAS if for any (cid:15) > the algorithm runsin time f (1 /(cid:15) ) n O (1) and approximates the optimal solution’s value up to a factor of − (cid:15) with highprobability. THE REVENUE CASE

In this section we consider the

Rev-HC objective. In Subsection 3.1 we show how to create a treewith constant sized sketch which approximates the optimal revenue tree up to an arbitrarily smallfactor (for an overview see Techniques). Note that this result holds for any revenue instance andthus may be of independent interest. We then leverage this and in Subsection 3.2 we present anEﬃcient-PRAS for instances with not all small weights. Finally, in Subsection 3.3 we show thata large family of metric-based similarity instances have weights that are not all small - therebyadmitting Eﬃcient-PRAS’s. We note that this partially solves an open question raised by Charikaret al. [2019b] regarding constant dimension instances and immediately provides Eﬃcient-PRAS’s forsimilarity instances deﬁned by a Gaussian Kernel in high dimensions when the minimal similarityis δ = Ω(1) which was speciﬁcally in their work as well. We begin by ﬁrst proving the existence of a tree with constant-sized sketch that approximates theoptimal tree arbitrarily well.

Theorem 3.1.

Let T O denote the optimal revenue tree and assume it contains n leaves (i.e., datapoints). Then, for any (cid:15) > , there exists a tree T R such that (i) T R contains Θ(1 /(cid:15) ) internalnodes each with at most (cid:15)n children, and (ii) rev ( T R ) ≥ (1 − (cid:15) ) rev ( T O ) . In order to construct T R we use a two step process: we ﬁrst create an intermediate tree, denotedas K ( T ) (to be deﬁned) and then convert that to our ﬁnal tree. In fact, this process may be appliedto any binary tree T (in particular, we will apply it to T O ). Before we can deﬁne the process thatgenerates K ( T O ), we must ﬁrst present several deﬁnitions and lemmas, the ﬁrst of which was shownby Dasgupta [2016] (this was not explicitly proven, and therefore we add the proof in the Appendixfor completeness). Lemma 3.2.

Given a rooted binary tree T with n data points as leaves, there exists an edgewhose removal creates two binary trees each with at least n data points (and therefore at most n ).Furthermore this edge can be found in polytime. Lemma 3.3.

Given a rooted binary tree T with n data points, there exists a set of edges F suchthat (cid:15) ≤ | F | + 1 ≤ (cid:15) and the number of data points in each tree of T − F is at least (cid:15)n and atmost (cid:15)n . Furthermore F can be found in polytime.Proof of Lemma 3.3. Let n denote the number of data points in T . We deﬁne the following recursivealgorithm: for any binary tree instance T ﬁnd the edge given by Lemma 3.2. Remove said edgeand continue recursively on both resulting trees. Stop once the input tree has less than 3 (cid:15)n datapoints.The algorithm is clearly polynomial. Let F denote the set of resulting edges. Due to ourstopping condition, every tree in T − F contains between (cid:15)n and 3 (cid:15)n data points. Therefore, (cid:15) + 1 ≤ | F | ≤ (cid:15) for (cid:15) < / Lemma 3.4.

For an arbitrary tree T , let V denote the set of vertices with degree ≥ and L denoteits set of leaves. Then, | V | ≤ | L | − . roof. Let T be some tree on n nodes and let (cid:96) denote some leaf. We prove by induction on n . If n = 1 or n = 2 clearly we are done. Otherwise, traverse T starting at (cid:96) (i.e., hopping from a nodeto one of its untravelled neighbours). If during this traversal we arrive at a leaf before we arrive ata node with degree ≥

3, then | V | = 0 and we are done. Otherwise let u denote the ﬁrst node wetraverse with degree ≥

3. Remove all nodes in the traversal upto but not including u , denote thenew tree as T (cid:48) .Thus, | V | ≤ | V (cid:48) | + 1 and | L | − | L (cid:48) | . Furthermore, since T (cid:48) has at most n − | V | ≤ | V (cid:48) | + 1 ≤ | L (cid:48) | = | L | − . Deﬁnition 5.

Given F as deﬁned by Lemma 3.3 we deﬁne two sets of nodes: blue and green,denoted by B and G . A blue node is any node connected to any edge of F or that is T ’s root. A green node is any node that is not blue and that has two children, each of which contains a bluenode as its descendant. Next we deﬁne the process that given a binary tree, contracts it compactly. Given an input T , wedenote the process’ output as K ( T ), formally deﬁned by Algorithm 1. (See Figure 2 for a pictorialexample). We note that each contracted node might have originally contained data points. Wetherefore associate every contracted node, c with its set of data points, D c . Finally, we deﬁne theprocess that given any binary tree T , outputs T R - formally deﬁned by Algorithm 2. Algorithm 1:

Algorithm to convert T to K ( T ).Obtain F as described in Lemma 3.3.Color the nodes green or blue as in Deﬁnition 5. for every tree T i in T − ( B ∪ G ) do Contract T i .Return the resulting tree as K ( T ). Algorithm 2:

Algorithm to convert T to T R . K ( T ) ← Algorithm 1 applied to T . for each node c ∈ K ( T ) and its set of data points D c do Attach a (new) auxiliary node as c ’s child (in K ( T )).Attach D c as the auxiliary node’s children.Return the resulting tree as T R . Remark.

We note that T R remains binary (except the auxiliary nodes). This is in fact true sinceotherwise this internal node would have contained at least 2 children which are colored green/blue(since it may only have a single auxiliary node). Thus, there would have been a green node containedwithin this contracted component in contradiction to the deﬁnition of K ( T ) . In what follows we show that for any binary tree T , (1) T R has a constant sketch and (2) | T Rij | is(approximately) upper bounded for any data points i and j (which in turn guarantees that rev ( T R )is close to T O when T = T O ). 8 emma 3.5. T R contains Θ(1 /(cid:15) ) internal nodes each with at most (cid:15)n children.Proof. We ﬁrst note that a node is a leaf in T R if and only if it was a leaf in T (since everycontracted connected component either contained data points or will have a child following thecontraction). Next, we categorize the internal nodes of T R . These nodes are either colored (greenor blue), or they are a contracted node or an auxiliary node. We denote the set of each such nodesby G , B , C and A respectively.It is not hard to see that the second part of our lemma holds. This is due to the fact thatby Remark 3.1 every node in G , B and C has at most 2 immediate children. For nodes in A , byLemma 3.3 and by A ’s deﬁnition, we are guaranteed that any such node has at most 3 (cid:15)n children.In order to show the ﬁrst part of the lemma we bound each of the four sets of nodes. By thedeﬁnition of B , | B | ≤ /(cid:15) . By deﬁnition of A , | A | ≤ | C | . Furthermore, every node in C has aparent that is colored green or blue and thus due to Remark 3.1, | C | ≤ | G | + | B | ). Therefore, | A | + | C | ≤ | G | + | B | ).Next we bound | G | . In order to do so, we ﬁrst simplify T R in a way that does not aﬀect | G | .Since no auxiliary node contains green nodes in their subtree, we may detach them without aﬀectingany green or blue nodes. Furthermore, this removal upholds the fact that any green node’s degreeis at least 3 (since we did not remove any blue nodes). We then also remove any contracted nodewhich now happens to be a leaf (since they too, do not aﬀect the green or blue nodes).Therefore, in the resulting tree, any leaf must be blue and any green node must have degree atleast 3. Thus, if we denote by V the set of vertices with degree ≥ L the set of leaves,then, | G | ≤ | V | ≤ | L | − ≤ | B | − , where the second inequality is due to Lemma 3.4. Thus, | A | + | C | + | G | + | B | ≤ | G | + | B | ) ≤ | B | ≤ /(cid:15). Now, in order to show the complement (i.e., T R contains Ω(1 /(cid:15) ) internal nodes) it is enough toconsider Lemma 3.3 thereby concluding the proof. Lemma 3.6.

For any two data points i and j , | T Rij | ≤ | T ij | + 6 (cid:15)n .Proof. Consider any three data points in T , i, j and k , such that k (cid:54)∈ T ij . We will show that k (cid:54)∈ T Rij for all but 6 (cid:15)n such k ’s. In order to prove our lemma we ﬁrst introduce the following notations.First, for any node u we denote the set of data points contained in its induced subtree as L ( u ).Secondly we note that any node colored green or blue in T will not be contracted and thereforewill appear in V ( T R ). Finally, we observe the following given our contraction process. Observation 1.

Let v ∈ V ( T ) denote a child of a green/blue node and let v ∗ ∈ V ( T R ) denote thenode that contracted v in T R . Therefore, L ( v ) = L ( v ∗ ) . Observation 2.

Data points i and j appear under the same auxiliary node in T R if and only if i and j were contained in the same tree of T − ( B ∪ G ) . Recall that our goal is to show that if k (cid:54)∈ T ij then k (cid:54)∈ T Rij . Towards that end, denote by v ij (resp. v ik and v jk ) i and j ’s LCA in T . Therefore, v ik = v jk and v ij is a descendant of v ik .Furthermore, let { T B ∪ G(cid:96) } denote the set of trees deﬁned by T − ( B ∪ G ) and let T B ∪ Gi (resp. T B ∪ Gj and T B ∪ Gk ) denote the tree in T − ( B ∪ G ) containing i (resp. j and k ).9e ﬁrst assume k (cid:54)∈ T B ∪ Gi and k (cid:54)∈ T B ∪ Gj . Therefore, a green or blue node must be either onthe path k → v ik , or on the path v ij → v ik . Otherwise there must be a green or blue node on thepath i → v ij and on the path j → v ij . We consider each case separately. (See Figure 4). v i v j v k v ij v ik = v jk case 1case 2case 3 Figure 4: Explanation to proof of Lemma 3.6 (such that v a = a for a ∈ { i, j, k } ). Case 1.

There exists a blue or green node on the path k → v ij : We further split this case intotwo cases. The ﬁrst is that i and j are part of the same tree of T − ( B ∪ G ). In this case they willend up under the same auxiliary node and due to Observation 2 we are guaranteed that k (cid:54)∈ T Rij .The second case is that i and j are not part of the same tree and therefore there exists a blue/greennode on the path i → j . Thus, the node v ik must be green or blue and due to Observation 1, i and j ’s lca will remain lower than i and k ’s in T R . Therefore, k (cid:54)∈ T Rij . Case 2.

There exists a blue or green node on the path v ij → v ik : In this case either v ik isgreen/blue and due to Observation 1 we are done. Otherwise some other node along v ij → v ik isgreen/blue and then Observation 1 guarantees that k will not enter the subtree deﬁned by i and j ’s lca. Thus, in any case, k (cid:54)∈ T Rij . Case 3.

There exists a green or blue node on the paths i → v ij and j → v ij : If v ij is green/bluethen Observation 1 guarantees that k will not enter the subtree deﬁned by i and j ’s lca. Otherwise,we are guaranteed to have two separate green/blue nodes, one on the path i → v ij and one on thepath j → v ij . Therefore, v ij must be green/blue. Hence, in either case, k (cid:54)∈ T Rij .Thus, we have shown that in all 3 cases if k (cid:54)∈ T B ∪ Gi and k (cid:54)∈ T B ∪ Gj then k (cid:54)∈ T Rij . Since thenumber of data points within both T B ∪ Gi and T B ∪ Gj is at most 3 (cid:15)n each, we get that at most 6 (cid:15)n such k ’s may be contained in T Rij . Therefore, | T Rij | ≤ | T ij | + 6 (cid:15)n , concluding the proof.Finally, combining Lemmas 3.5 and 3.6 for T = T O (i.e., the revenue optimal solution) with Fact2.1, is enough to prove Theorem 3.1. Proof of Theorem 3.1.

Lemma 3.5 is enough to prove the ﬁrst bullet. We consider the secondbullet. It is a known fact that T O may be taken to be binary. Therefore, due to Lemma 3.6 andFact 2.1, we get, rev ( T R ) = (cid:88) i

0, let | V | = n and k = (cid:100) (cid:15) (cid:101) . Finally, let T R(cid:15) denote the tree guaranteed by Theorem3.1 for (cid:15) . We may deﬁne T R(cid:15) ’s revenue as follows. For every one of T R(cid:15) ’s internal nodes i , denoteby D i its set of children that are data points. Furthermore, let W ij denote the total weight of theset of (similarity) edges crossing between D i and D j . Therefore, rev ( T R(cid:15) ) = (cid:80) i

EPRAS for Revenue case.Enumerate over all trees, T , with k internal leaves. for each such T dofor { α i } i ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } dofor { β ij } i ≤ k,j ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } do Run

P T ( { α i } , { β ij } , (cid:15) err = (cid:15) , δ ).Compute the revenue given T and P T ’s output.Return the maximal revenue tree encountered.

Lemma 3.7.

For every (cid:15) > , Algorithm 3 guarantees an approximation factor of (1 − (cid:15) − (cid:15)ρτ ) . We note that the error from the property tester is oﬀset by the revenue from the optimalsolution.

Theorem 3.8.

Algorithm 3 is an Eﬃcient-PRAS.Proof.

Lemma 3.7 guarantees that there exists ˆ (cid:15) > (cid:15) = 18 (cid:15) + (cid:15)ρτ ) such that ouralgorithm is a 1 − ˆ (cid:15) approximation. The property tester runs in time, exp (log( δ(cid:15) err )( O (1) (cid:15) err ) k +1 ) +11 ( log( k/ ( (cid:15) err δ )) (cid:15) err ) n . Further, we call the tester k k · (3 /(cid:15) ) k · (9 /(cid:15) ) k times. Now, since (cid:15) < ˆ (cid:15) , if (cid:15) err = (cid:15) then the algorithm is an Eﬃcient-PRAS. We follow the deﬁnitions as seen in Charikar et al. [2019b]. Suppose that our data points lie ona metric M with doubling dimension D ( M ). Deﬁne a non-increasing function g : R ≥ → [0 , i and j let d ij denote their distance as deﬁned by our metric. Furthermore,we deﬁne the metric-based similarity weights w ij = g ( d ij ).Deﬁne A ( (cid:15) ) = A to be the tree generated by the algorithm that adds a constant (cid:15) to all weightsand then runs Algorithm 3 for ρ, τ -weighted instances. We note that A is well deﬁned since thealtered weights deﬁne a graph with not all small weights for τ = (cid:15) and ρ = 0.The following theorem shows that for a large class of functions g and metrics M , algorithm A is in fact an Eﬃcient-PRAS. Theorem 3.9.

Assume the metric’s doubling dimension guarantees D ( M ) = O (1) and g is scaleinvariant and (cid:96) -Lipschitz continuous for (cid:96) = O (1) . Then, A is an Eﬃcient-PRAS for the inducedRevenue instance.Proof. Let w ij = g ( d ij ) and let w (cid:48) ij = w ij + (cid:15) . Denote by O and O (cid:48) the trees which generate themaximal revenue with respect to w ij and w (cid:48) ij respectively. Finally, given an HC tree T , let Rev ( T )and Rev (cid:48) ( T ) denote the revenue generated by T with respect to w ij and w (cid:48) ij respectively.By Theorem 3.8 we are guarnateed that for any constant δ > Rev (cid:48) ( A ) ≥ (1 − δ ) Rev (cid:48) ( O (cid:48) ).Furthermore, by the deﬁnitions of O and O (cid:48) we have that Rev (cid:48) ( O (cid:48) ) ≥ Rev (cid:48) ( O ). Therefore, Rev (cid:48) ( A ) ≥ (1 − δ ) Rev (cid:48) ( O (cid:48) ) ≥ (1 − δ ) Rev (cid:48) ( O ) . (1)By Fact 2.3 and since w ij + (cid:15) = w (cid:48) ij we are guaranteed that for any tree T , Rev ( T ) = Rev (cid:48) ( T ) − (cid:15) n (cid:0) n (cid:1) . Combining this with equation 1 we get that, Rev ( A ) = Rev (cid:48) ( A ) − (cid:15) n (cid:18) n (cid:19) ≥ (1 − δ ) Rev (cid:48) ( O ) − (cid:15) n (cid:18) n (cid:19) = (1 − δ ) Rev ( O ) − δ(cid:15) n (cid:18) n (cid:19) . Let α denote the diameter of the metric. Since the metric is scale invariant we may assumew.l.o.g. that α = 1. By the deﬁnition of the doubling dimension, D ( M ) = D , there are 2 D ( (cid:96) +1) balls of radius (cid:96) +1 that cover the entirety of the data. Let x i denote the number of data pointsthat belong to the i ’th ball but not to balls 1 , . . . , i −

1. Therefore, (cid:80) D ( (cid:96) +1) i =1 x i = n . On the otherhand by Cauchy-Schwarz inequality, (cid:80) D ( (cid:96) +1) i =1 x i ≥ n D ( (cid:96) +1) . Therefore, the number of pairs of datapoints within the same ball is (cid:80) D ( (cid:96) +1) i =1 (cid:0) x i (cid:1) ≥ n D ( (cid:96) +1)+1 − n . Due to the fact that pairs of pointsthat belong to the same ball are at distance of at most (cid:96) and since similarity function g is deﬁned12n non-increasing, we get that, (cid:88) i,j w ij ≥ g ( 12 (cid:96) ) D ( (cid:96) +1) (cid:88) i =1 (cid:18) x i (cid:19) ≥ g ( 12 (cid:96) ) (cid:0) n D ( (cid:96) +1)+1 − n (cid:1) . (2)By Fact 2.1 and equation 2 we are guaranteed that for c = D ( (cid:96) +1) g ( (cid:96) ) , cδ(cid:15)Rev ( O ) ≥ δ(cid:15) n (cid:0) n (cid:1) .Combining the above, Rev ( A ) ≥ (1 − δ − cδ(cid:15) ) Rev ( O ) . Due to the fact that g (0) = 1 and that g is (cid:96) -Lipschitz continuous, g ( (cid:96) ) = Ω(1). On the otherhand since D = O (1) and (cid:96) = O (1) we may choose (cid:15) and δ small enough in order to guarantee anEPRAS. In this section we show how to create a tree that approximates the optimal dissimilarity value.This tree is produced by taking K ( T O ) for the optimal tree, T O (as deﬁned earlier) and altering it.As opposed to the revenue case, this theorem guarantees O (1 /(cid:15) ) internal nodes while maintaininga (1 − (cid:15) ) approximation. Note that this result holds for any dissimilarity instance and thus may beof independent interest. For an overview we refer the reader to our Techniques section. Theorem 4.1.

Let T O denote the optimal dissimilarity tree and assume it contains n leaves (i.e.,data points). Then, for any (cid:15) > , there exists a tree T D such that (i) T D contains Θ(1 /(cid:15) ) internalnodes, each with at most (cid:15) n children, and (ii) dis ( T D ) ≥ (1 − (cid:15) ) dis ( T O ) . In order to obtain T D given a binary tree, T , we use K ( T ) (as deﬁned in Section 3). We thenconvert K ( T ) to T D , by randomly partitioning each contracted node’s data points into 1 /(cid:15) clustersand attaching them in a “comb”-like structure. The process is deﬁned in Algorithm 4 (see Figure3 for an example). Algorithm 4:

Algorithm to convert T to T D . K ( T ) ← Algorithm 1 applied to T . for each node c ∈ K ( T ) and its data points D c do Partition D c into 1 /(cid:15) random sets of equal sizes, P = { P , . . . , P /(cid:15) } . for P i ∈ P do Create a new auxiliary node, u i .Attach P i as u i ’s children.Create a new node (cid:96) i , and attach it between c and its parent.Attach u i as (cid:96) i ’s child.Return the resulting tree as T D . 13ote that D c = ∅ if c is the root (since the root is blue) and therefore (cid:96) i is indeed only deﬁned for c ’s that have a parent. Also note that as in Remark 3.1, T D remains binary if we disregard theauxiliary nodes. Next we show that T D is of constant size and that | T Dij | is (approximately) lowerbounded. Lemma 4.2. T D contains at most /(cid:15) and at least /(cid:15) internal nodes with at most (cid:15) n children. Lemma 4.3.

The resulting tree, T D , guarantees in expectation that, | T Dij | ≥ (1 − (cid:15) ) | T ij | − (cid:15)n . We defer the proofs of Lemmas 4.2 and 4.3 to the Appendix. Finally, combining Lemmas 4.2and 4.3 for T = T O with Fact 2.2, is enough to prove Theorem 4.1. (For the formal proof, seeAppendix). In this section we consider the problem of ﬁnding an optimal dissimilarity tree in instances withweights that are not all small and present an Eﬃcient-PRAS. As in the revenue case, again we showthat this is the best one could hope for, and complement our result by showing that the problemis NP-Complete and thus does not admit an optimal, polynomial solution (see Theorem 5.2 in theAppendix)Let (cid:15) > T D(cid:15) denote the tree guaranteed by Theorem 4.1 for (cid:15) . As in the revenuecase, for an internal node of T D , i , let D i denote the set of data points that are i ’s children andlet W ij denote the set of (dissimilarity) edges crossing between D i and D j . Therefore, dis ( T D(cid:15) ) = (cid:80) i,j ∈ S (cid:0) W ij (cid:80) (cid:96) ∈ S | D (cid:96) | (cid:1) + b , where the second sum is over all sets D (cid:96) contained in T Dij (as deﬁned by T D(cid:15) ’s sketch). Furthermore, b is deﬁned as the dissimilarity gained by nodes within the same ”star”structure. Theorem 4.1 guarantees that | D i | is small - therefore, since our instance has weightsthat are not all small (and by Fact 2.2 the optimal solution is large) this dissimilarity is negligibleand we may assume b = 0 since we already lose a factor of 1 − (cid:15) . Finally, recall that | S | ≤ k .Our Eﬃcient-PRAS follows as in the revenue case and is therefore deferred to the Appendix(Algorithm 7). The following theorem is proven identically to the revenue case and is thereforeomitted. Theorem 4.4.

Algorithm 7 is an EPRAS for dissimilarity instances with weights that are not allsmall.

When considering instances with weights that are not all small, we have only shown Eﬃcient-PRAS’s up until now. To complement our results, we show that we can not hope for optimal,polynomial algorithms, assuming the Small Set Expansion (SSE) hypothesis. (For a formal def-inition of SSE see Charikar and Chatziafratis [2017]). In fact, it is enough to show that theseobjectives are NP-complete assuming the instances are (1) unweighted and (2) guarantee that (cid:80) i

The Revenue objective for dense instances is in NPC (assuming SSE).

Theorem 5.2.

The Dissimilarity objective for dense instances is in NPC (assuming SSE).

Theorem 5.3.

The

HCC ± objective is in NPC (assuming SSE). HIERARCHICAL CORRELATION CLUSTERING

In this section we consider the case where the collected data may contain both similarity and dissimi-larity information. We ﬁrst show a worst case approximation and thereafter show an Eﬃcient-PRASfor

HCC ± . Here we consider two separate algorithms which, if combined properly, will yield our approximation.The ﬁrst is a simple greedy algorithm whereas the second optimizes for the

Max-Uncut Bisection problem for its top most cut and then continues with the greedy algorithm. We ﬁrst show baselineguarantees of the greedy algorithm and then use the work of Alon et al. [2020] in order to obtainguarantees on the second algorithm with respect to the

HCC objective. We defer the following proofto the appendix.

Proposition 6.1.

There exists a greedy algorithm, denoted by

ALG

GRE , that returns an HC tree T guaranteeing, hcc ( T ) ≥ ( n − (cid:88) ij w sij + n (cid:88) ij w dij . Denote by

ALG

MUB the algorithm that generates an HC tree by ﬁrst cutting according to

Max-Uncut Bisection based on the similarity weights of the instance and then running

ALG

GRE oneach of the two resulting sides. Let

OPT = OPT s + OPT d be the value of the optimum HCC tree where

OPT s = (cid:80) w sij ( n − | O ij | ) and OPT d = (cid:80) w dij | O ij | , deﬁned such that O ij denotes the number of leavesin the subtree rooted at the LCA of i and j in the tree of OPT . Lemma 6.2.

Let T denote the HC tree returned by ALG

MUB . Therefore, hcc G ( T ) ≥ . · OPT s + · OPT d Proof.

For ease of exposition let T = T . The top-split of T is a bisection which means that | L | = | R | = n . For ease of notation let: W sL = (cid:88) i,j ∈ L w sij and W dL = (cid:88) i,j ∈ L w dij Similarly, we deﬁne W sR and W dR . Notice that for the L side, Greedy will contribute at least · n · W dL to (cid:80) w dij | T ij | , as per Proposition 6.1. Similarly, for the R side. This means that in thetree T , any edge contributes either · n (if it was cut by Greedy ) or n (if it was cut at the top-splitof Max-Uncut Bisection ). In any case, we have: (cid:88) w dij | T ij | ≥ · n (cid:88) w dij ≥ OPT d (3)by using the upper bound OPT d ≤ n (cid:80) w dij .We now deal with OPT s . Observe that: (cid:88) w dij ( n − | T ij | ) ≥ W + L ( n + n ) + W sR ( n + n ) ≥ n ( W sL + W sR )15ince every edge within L will contribute n due to the bisection, plus an extra n due to the greedystep. The same is true for edges in R .Finally, since we used a 0.8776 for Max-Uncut Bisection , it holds directly from Alon et al.[2020] that: (cid:88) w dij ( n − | T ij | ) ≥ · . · OPT s ≥ . · OPT d (4)The lemma follows by summing eq. (3) and (4).Finally, we combine Proposition 6.1 and Lemma 6.2 in order to yield the following Theorem(whose proof is defered to the appendix). Theorem 6.3.

Running

ALG

GRE with probability p and otherwise ALG

MUB guarantees an ap-proximation of 0.4767 for the

HCC objective, when p = 0.43.

Here we consider the

HCC ± objective (as deﬁned earlier in the introduction) and show an Eﬃcient-PRAS. We also complement our results and show that in fact this problem is NP-Complete andthus we cannot hope for an optimal, polynomial solution (see Theorem 5.3 in the Appendix).Let ALG ± denote the algorithm that runs Algorithm 3 and Algorithm 7 simultaneously andreturns the tree maximizing the HCC ± objective. We prove that ALG ± is in fact an Eﬃcient-PRASfor the HCC ± objective. We defer the theorem’s proof to the appendix. Theorem 6.4.

ALG ± is an Eﬃcient-PRAS for the HCC ± objective. In this paper we show that to optimize for the

Rev-HC and

Dis-HC objectives, it suﬃces to considerHC trees with constant-sized sketches, thereby greatly simplifying these problems. This result canbe applied to both the heuristic setting (since it greatly reduces the range of optimal solutions thatneed to be considered) and the approximation setting. Speciﬁcally, an approximation algorithmmay iterate over all constant sized trees. Thereafter, it will need to partition the data points intothe leaves of the constant-sized tree - thus reducing our problem to the well-studied realm of graphpartitioning problems.We then consider the family of instances with weights that are not all small. We show Eﬃcient-PRAS’s for both

Rev-HC and

Dis-HC objectives. Furthermore, we show that this family of instancesencompasses many metric-based similarity instances. Finally, we introduce the

HCC objective whichwe hope will provide a better connection between the realms of correlation and hierarchical clus-tering. We then show a worst case approximation of 0.4767 and show an Eﬃcient-PRAS for the

HCC ± objective that leverages our algorithms presented for the Rev-HC and

Dis-HC objectives forinstances with weights that are not all small.

The authors would like to deeply thank Claudio Gentile and Fabio Vitale for their helpful discussionsand insights regarding the connection to metric-based similarity instances. We also thank SaraAhmadian and Alessandro Epasto for interesting discussions during early stages of our work.16 eferences

Sara Ahmadian, Vaggos Chatziafratis, Alessandro Epasto, Euiwoong Lee, Mohammad Mahdian,Konstantin Makarychev, and Grigory Yaroslavtsev. Bisect and conquer: Hierarchical clusteringvia max-uncut bisection.

CoRR , abs/1912.06983, 2019.Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: rankingand clustering.

Journal of the ACM (JACM) , 55(5):1–27, 2008.Noga Alon, Yossi Azar, and Danny Vainstein. Hierarchical clustering: A 0.585 revenue approxi-mation. In Jacob D. Abernethy and Shivani Agarwal, editors,

Conference on Learning Theory,COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria] , volume 125 of

Proceedings of Ma-chine Learning Research , pages 153–162. PMLR, 2020. URL http://proceedings.mlr.press/v125/alon20b.html .U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broadpatterns of gene expression revealed by clustering analysis of tumor and normal colon tissuesprobed by oligonucleotide arrays.

Proceedings of the National Academy of Sciences , 96(12):6745–6750, 1999. ISSN 0027-8424. doi: 10.1073/pnas.96.12.6745. URL .Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. In , page 238, 2002.Pavel Berkhin. A survey of clustering data mining techniques.

Grouping Multidimensional Data ,pages 25–71, 2006.Francesco Bonchi, David Garcia-Soriano, and Edo Liberty. Correlation clustering: from theory topractice. In

KDD , page 1972, 2014.Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza, Jennifer C. Lai, and Robert L. Mercer.Class-based n-gram models of natural language.

Computational Linguistics , 18(4):467–479, 1992.Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cutand spreading metrics. In

Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19 , pages841–854, 2017.Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative infor-mation.

Journal of Computer and System Sciences , 71(3):360–383, 2005.Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better thanaverage-linkage. In

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019 , pages 2291–2304,2019a.Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, and Grigory Yaroslavtsev. Hierarchical clus-tering for euclidean data. In

The 22nd International Conference on Artiﬁcial Intelligence andStatistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan , pages 2721–2730, 2019b.URL http://proceedings.mlr.press/v89/charikar19a.html .17aggos Chatziafratis, Neha Gupta, and Euiwoong Lee. Inapproximability for local correlationclustering and dissimilarity hierarchical clustering. arXiv preprint arXiv:2010.01459 , 2020. URL https://arxiv.org/abs/2010.01459 .Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near optimallp rounding algorithm for correlationclustering on complete and complete k-partite graphs. In

Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages 219–228,2015.William Cohen and Jacob Richman. Learning to match and cluster entity names. In

ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval , 2001.William W Cohen and Jacob Richman. Learning to match and cluster large high-dimensional datasets for data integration. In

Proceedings of the eighth ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 475–480, 2002.Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchicalclustering: Objective functions and algorithms. In

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10,2018 , pages 378–397, 2018.Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In

Proceedings of the48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA,USA, June 18-21, 2016 , pages 118–127, 2016.Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection:A survey.

IEEE Transactions on knowledge and data engineering , 19(1):1–16, 2006.Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximumcut and satisﬁability problems using semideﬁnite programming.

J. ACM , 42(6):1115–1145, 1995.Oded Goldreich, Shaﬁ Goldwasser, and Dana Ron. Property testing and its connection to learningand approximation.

J. ACM , 45(4):653–750, 1998.N Jardine and R Sibson. A model for taxonomy.

Mathematical Biosciences , 2(3-4):465–482, 1968.Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, and Chang D Yoo. Higher-order correlationclustering for image segmentation. In

Advances in neural information processing systems , pages1530–1538, 2011.Benjamin Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Averagelinkage, bisecting k-means, and local search. In

Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA , pages 3094–3103, 2017.Anirudh Ramachandran, Nick Feamster, and Santosh Vempala. Filtering spam with behavioralblacklisting. In

Proceedings of the 14th ACM conference on Computer and communicationssecurity , pages 342–351, 2007. 18inwook Seo and Ben Shneiderman. Interactively exploring hierarchical clustering results.

IEEEComputer , 35(7):80–86, 2002. doi: 10.1109/MC.2002.1016905. URL https://doi.org/10.1109/MC.2002.1016905 .Peter HA Sneath and Robert R Sokal. Numerical taxonomy.

Nature , 193(4818):855–860, 1962.Chaitanya Swamy. Correlation clustering: maximizing agreements via semideﬁnite programming.In J. Ian Munro, editor,

Proceedings of the Fifteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004 , pages 526–527.SIAM, 2004. URL http://dl.acm.org/citation.cfm?id=982792.982866 .19

DEFERRED PROOFS OF SUBSECTION 3.1

Proof of Lemma 3.2.

We ﬁrst note that the removal of any edge creates two binary trees. Next weshow how to ﬁnd an edge satisfying the rest of the properties.Given the rooted tree T , we travel down the tree from the root such that we always pick thechild that contains more data points in its subtree (compared to the other child, if another childexists). We denote the i ’th node along this path that contains exactly two children, by u i for i ∈ { , , . . . } . Furthermore, we denote the sets of data points contained by its two children by A i and B i such that, | A i | ≥ | B i | .Let k ∗ := arg min i {| B | + · · · + | B i | ≥ n } . Since | A k ∗ | + | B | + · · · | B k ∗ | = n , we are guaranteedthat | A k ∗ | ≤ n . On the other hand, since | A k ∗ | ≥ | B k ∗ | and | A k ∗ | + | B k ∗ | = n − ( | B | + · · · | B k ∗ − | )we are also guaranteed that, | A k ∗ | ≥ n .Therefore, removing the edge between u k ∗ and its child associated with A k ∗ guarantees that theresulting trees each have at most at least n/ B DEFERRED PROOFS AND DEFINITIONS OF SUBSECTION3.2

Observation 3.

Due to Fact 2.1 if we denote by T O our optimal solution, then since our instanceis ρ, τ -weighted we get, rev ( T O ) ≥ ρτ n , for some smaller, yet still constants ρ and τ .Proof of Lemma 3.7. Let T alg denote the tree returned by Algorithm 3. Furthermore denote by α (cid:96) and β ij the real values of T R(cid:15) . Therefore, rev ( T alg ) ≥ (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) ( α (cid:96) − n(cid:15) − n(cid:15) err ) · ( β ij − n (cid:15) − n (cid:15) err ) (cid:1) ≥ (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) α (cid:96) β ij (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) β ij n(cid:15) (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) β ij n(cid:15) err (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) α (cid:96) n (cid:15) (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) α (cid:96) n (cid:15) err (cid:1) ≥ (cid:0) (cid:88) i ≤ j (cid:88) (cid:96) ∈ S α (cid:96) β ij (cid:1) − n (cid:15) k − n (cid:15) err k − n (cid:15) (20 k ) − n (cid:15) err (20 k ) = (cid:0) (cid:88) i ≤ j (cid:88) (cid:96) ∈ S α (cid:96) β ij (cid:1) − n (cid:0) (cid:15) k + (cid:15) (20 k ) + (cid:15) err k + (cid:15) err (20 k ) (cid:1) ≥ rev ( T R(cid:15) ) − n (421 (cid:15) + 20 k(cid:15) err + 400 k (cid:15) err ) , α (cid:96) and β ij to their exact values. The third inequality follows since there are at most k setsin the partition, (cid:80) β ij ≤ n and (cid:80) α (cid:96) ≤ n . The last inequality is due to the fact that k ≤ /(cid:15) + 1and (cid:15) is chosen to be small enough.Due to Observation 3, Theorem 3.1 and by choosing (cid:15) err = (cid:15) , we get, rev ( T alg ) ≥ rev ( T R(cid:15) ) − n ( O ( (cid:15) )) ≥ rev ( T R(cid:15) ) − O ( (cid:15) ) ρτ rev ( T O ) ≥ (1 − O ( (cid:15) ) − O ( (cid:15) ) ρτ ) rev ( T O ) . Thus by choosing (cid:15) small enough, we get the desired result.

C DEFERRED PROOFS OF SUBSECTION 4.1

Proof of Lemma 4.2.

Consider the proof of Lemma 3.5. The only diﬀerence between T R and T D (with respect to the number of their internal nodes) is the fact that in T D the contracted nodes aremultiplied by 1 /(cid:15) (and therefore the auxiliary nodes as well). Thus, clearly the lemma holds. Proof of Lemma 4.3.

In order to prove the lemma we consider the following observations. The ﬁrstof which is Observation 1 which holds here as well. The second is the following.

Observation 4.

Consider any two data points, i and j , that are contained in the same contractednode in K ( T O ) . Further assume that they end up under diﬀerent auxiliary nodes. Therefore, anydescendant of the corresponding contracted node (in K ( T O ) ) is contained in T Dij . Consider two data points in T O , i and j and consider some k ∈ T Oij . As before, we denote theirlca’s by v ik , v jk and v ij and assume without loss of generality that i is clustered ﬁrst with k andtherefore, v ij = v kj .We would like to bound the number of k ’s for which k (cid:54)∈ T Dij . As before, let { T B ∪ G(cid:96) } denotethe set of trees deﬁned by T O − ( B ∪ G ) and let T B ∪ Gi (resp. T B ∪ Gj and T B ∪ Gk ) denote the tree in T O − ( B ∪ G ) containing i (resp. j and k ). If k ∈ T B ∪ Gi or k ∈ T B ∪ Gj then since the number of datapoints contained in these trees is at most 6 (cid:15)n , we may disregard such k ’s and incur an additive lossof 6 (cid:15)n . Therefore, we assume, k (cid:54)∈ T B ∪ Gi and k (cid:54)∈ T B ∪ Gj .Thus, we split into the following cases. The ﬁrst is the case where v jk is green/blue. Otherwise,this means that v jk has at most one child with a blue descendant. It can not be the child containing j since that would mean that k ∈ T B ∪ Gi . Thus, we may only consider the following ﬁnal cases:either exists a green/blue node on the path v ik → v ij or there must exist a green/blue node bothon the path k → v ik and on the path i → v ik (since k (cid:54)∈ T B ∪ Gi ). Otherwise, exists a green/bluenode on the path k → v ik and not on the path i → j .We prove our lemma for each of these cases.1. v jk is green/blue: Due to Observation 1 we are guaranteed that k ∈ T Dij .2. There exists a green/blue node on the path v ik → v ij : Due to Observation 1 we are guaranteedthat k ∈ T Dij . 21. There exists a green/blue node both on the path k → v ik and on the path i → v ik : In this case v ik is green/blue and therefore, again due to Observation 1 we are guaranteed that k ∈ T Dij .4. There exists a green/blue node on the path k → v ik and not on the path i → j : In this case i and j are in the same contracted node in K ( T O ). If they end up under diﬀerent auxiliarynodes, then by Observation 2 k ∈ T Dij . Since we partitioned the data points in the contractednodes randomly (under restriction that the sets are of the same size), the probability that i and j will end up under diﬀerent auxiliary nodes is ≥ (1 − (cid:15) ).Thus, in any case, E [ | T Dij | ] ≥ (1 − (cid:15) ) | T Oij | − (cid:15)n . Proof of Theorem 4.1.

Lemma 4.2 guarantees the ﬁrst bullet. For the second bullet, denote by T O the optimal solution. We note that T O is binary. Furthermore, due to Lemma 4.3 and Fact 2.2,we get, E [ dis ( T D )] = (cid:88) i

38 we get the desired result.

D DEFERRED ALGORITHMS OF SUBSECTION 4.2

Algorithm 5:

EPRAS for the dense dissimilarity case.Enumerate over all trees, T , with k internal leaves. for each such T dofor { α i } i ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } dofor { β ij } i ≤ k,j ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } do Run

P T ( { α i } , { β ij } , (cid:15) err = (cid:15) , δ ).Compute the dissimilarity based on T and P T ’s output.Return the maximal dissimilarity tree encountered.

E DEFERRED PROOFS OF SECTION 6

Proof of Proposition 6.1.

For each vertex v ∈ V , our algorithm maintains scores s ( v ) which areinitially set to zero. The algorithm will actually remove the node of largest score at each step andrecurse on the remaining vertices, hence producing a caterpillar tree (a tree whose every internalnode has at least one leaf). A similar greedy strategy to the one described below can also produce a22ree (not necessarily caterpillar) in a bottom-up fashion by repeatedly merging node pairs. Noticethat the algorithm is deterministic.For every edge ( i, j ) of similarity weight w sij , decrease s ( i ) and s ( j ) by n − w sij , and increaseevery other score s ( k ) by w sij , where k ∈ V \ { i, j } . The intuition behind such assignments, is thatfor a pair i, j of similarity w sij , whenever we remove another node k ﬁrst, k ’s contribution to the hcc objective increases by w sij , as k lies outside of the lowest common ancestor between i, j . Similarly,for every edge ( i, j ) of dissimilarity w dij , we increase s ( i ) and s ( j ) by n w dij , and decrease every otherscore s ( k ) by w dij , where k ∈ V \ { i, j } .Next, let u ∈ V have the largest score and V (cid:48) = V \ { u } . Remove u and any adjacent edgesfrom the graph, then recursively construct a tree T (cid:48) restricted on V (cid:48) for its leaves (if | V (cid:48) | = 2, justoutput the unique binary tree on the two nodes). The ﬁnal output of the algorithm is a new tree T with one child being u and the other child being the root of T (cid:48) .We now prove correctness: Let u as above and let w su = (cid:80) ( u,v ) w suv , w du = (cid:80) ( u,v ) w duv , W s = (cid:80) ( i,j ) w sij , W d = (cid:80) ( i,j ) w dij . Notice that according to the scoring rule of our algorithm: s ( u ) = ( W s − w su ) − n − w su − ( W d − w du ) + n w du Note that by induction, tree T (cid:48) that has n − hcc ( T (cid:48) ) ≥ ( n − W s − w su ) + ( n − W d − w du ) (5)Since u had the largest score, it follows that s ( u ) ≥

0. Therefore:( W s − w su ) − ( W d − w du ) ≥ n − w su + n w du We add [( W s − w su ) − ( W d − w du )] to both sides:( W s − w su ) − ( W d − w du ) ≥ ( hcc su − ( W d − w du ) − nw du )where hcc su = ( n − w su + ( W s − w su ) is the total contribution u can have due to similarity weightsin any tree. By rearranging terms:( W s − w su ) + nw du ≥ hcc su + hcc du (6)where hcc du = ( W d − w du ) + nw du is the total contribution u can have due to dissimilarity edges inany tree.Let hcc u ( T ) be the contribution towards the hcc objective of node u in T and observe we caneasily compute this quantity as u got removed ﬁrst. In other words, hcc u ( T ) = ( W s − w su ) + nw du ,as any dissimilarity edge ( u, · ) has a lowest common ancestor of size n and for every similarityedge ( i, j ) , i, j (cid:54) = u , u is a non-leaf of T ij . Summing up eq. (5) and (6), and noting that hcc ( T ) = hcc u ( T ) + hcc ( T (cid:48) ) concludes the proof. Proof of Theorem 6.3.

A simple calculation suggests that the expected value for

HCC is at least:min p (cid:8) p · + 0 . · (1 − p ) , p · + (1 − p ) · (cid:9) By balancing the two terms, the minimum is achieved when the parameter p = 1 − . and theﬁnal approximation factor becomes 0.4767. 23 roof of Theorem 6.4. There are two cases to consider: either (cid:80) e w de ≥ (cid:80) e w se or (cid:80) e w de ≤ (cid:80) e w se .We ﬁrst consider the case that (cid:80) e w de ≥ (cid:80) e w se (the second is handled symmetrically). We rewritethe objective function for some HC tree T . hcc ± ( T ) = (cid:88) e w de ( T e ) + (cid:88) e w se ( n − T e )= (cid:88) e w de ( T e ) + (cid:88) e (1 − w de )( n − T e )= 2 (cid:88) e w de ( T e ) + (cid:88) e ( n − T e ) − n (cid:88) e w de = 2 (cid:88) e w de ( T e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de , where the last equality follows from Fact 2.3. We ﬁrst observe that a tree that maximizes thedissimilarity instance deﬁned by w de is a tree that maximizes the original HCC ± objective. Let O d denote the tree maximizing the dissimilarity objective and let O denote the tree maximizingthe HCC ± objective. By Theorem 4.4 we know that for any constant (cid:15) > ALG ) generates dissimilarity of at least (1 − (cid:15) ) (cid:80) e w de ( O de ) = (1 − (cid:15) ) (cid:80) e w de ( O e ).Therefore, for any (cid:15) > hcc ± ( ALG ) = 2 (cid:88) e w de ( ALG e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de ≥ − (cid:15) ) (cid:88) e w de ( O e )+ 13 n (cid:18) n (cid:19) − n (cid:88) e w de = (1 − (cid:15) ) (cid:88) e w de ( O e )+ (cid:88) e w de ( O e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de ≥ (1 − (cid:15) ) · (cid:0) (cid:88) e w de ( O e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de (cid:1) = hcc ± ( O ) , where the last inequality follows from Fact 2.2.The case that (cid:80) e w de ≤ (cid:80) e w se is solved symmetrically (using Theorem 3.8 and Fact 2.1) whichconcludes the proof. F HARDNESS RESULTS

Proof of Theorem 5.1.

Note that clearly the problem is in NP (since given a tree its revenue maybe checked eﬃciently), therefore we only need to show that it is NP-hard.24hmadian et al. [2019] showed that the unweighted revenue case is APX-hard under the SmallSet Expansion hypothesis. This in turn guarantees that the unweighted revenue problem is NP-hardassuming the Small Set Expansion. Next we show how to reduce an unweighted revenue instanceto a dense unweighted revenue instance (in polynomial time).Roughly speaking we will simply add a disconnect clique of size n to the general graph. Formally,let G = ( D, E D , w ) denote a general revenue instance such that, D = { d , . . . , d n } . We convert G to a dense instance G (cid:48) = ( V, E V , w (cid:48) ) simply by adding a clique of size n (disconnected from V )with similarities of size 1. We denote this clique’s set of nodes by L = { (cid:96) , . . . , (cid:96) n } . Therefore, w (cid:48) ( (cid:96) i , (cid:96) j ) = 1 , w (cid:48) ( d i , d j ) = w ( d i , d j ) and w (cid:48) ( (cid:96) i , d j ) = 0.Clearly G (cid:48) is dense. Let T (cid:48) denote the optimal solution to G (cid:48) . It is known that the optimal treeﬁrst cuts the disconnected components of G (cid:48) . Therefore, there exists a node u in T (cid:48) such that thesubtree rooted at u contains the entirety of L and no data points from D . Since D is disconnectedfrom L and due to the deﬁnition of the revenue goal function, taking u and moving it to the topof T (cid:48) (formally, if r (cid:48) is the root of T (cid:48) , then we create a new root, r and attach u and r (cid:48) as itsimmediate children), can only increase T (cid:48) ’s revenue. Thus, we may assume w.l.o.g. that in T (cid:48) theroot already disconnects L and D .Let v D and v L denote T (cid:48) ’s root’s immediate children containing D and L respectively. Let T (cid:48) D denote the subtree rooted at u D . T (cid:48) D is clearly optimal for instance G (since otherwise, we couldhave replaced T (cid:48) D with the optimal tree for G , thereby increasing T (cid:48) ’s revenue, contradicting thefact that it is optimal).Thus, we converted, in polynomial time, the optimal tree for G (cid:48) to the optimal tree for G ,proving that the dense revenue problem is NP-hard. Deﬁnition 6.

We say that an unweighted graph is complement-dense if its complement graph (i.e.,the graph we get by removing all existing edges and adding all missing edges) is dense.

Lemma F.1.

The problem of ﬁnding a maximal revenue tree for revenue instances which arecomplement-dense is NP-complete (assuming the Small Set Expansion hypothesis).Proof.

Note that clearly the problem is in NP (since given a tree its revenue may be checkedeﬃciently),therefore we only need to show that it is NP-hard.As in Theorem 5.1, we reduce an unweighted revenue instance to a complement-dense un-weighted revenue instance. Speciﬁcally we do this by adding a disconnected path of length n to the original graph. Formally, let G = ( D, E D , w ) denote a general revenue instance such that, D = { d , . . . , d n } . We convert G to a complement-dense instance G (cid:48) = ( V, E V , w (cid:48) ) simply by addinga path of size n (disconnected from V ) with similarities of size 1. We denote this path’s set ofnodes by L = { (cid:96) , . . . , (cid:96) n } . Therefore, w (cid:48) ( (cid:96) i , (cid:96) i +1 ) = 1 , w (cid:48) ( d i , d j ) = w ( d i , d j ) and w (cid:48) ( (cid:96) i , d j ) = 0.Note that G (cid:48) is clearly complement-dense.As in the proof of Theorem 5.1 exists a node u in the optimal solution of G (cid:48) , T (cid:48) , such that u contains the entirety of L and no data points from D . Again, we may move u and its subtree to theroot of T (cid:48) thereby only increasing the revenue. Thus, given T (cid:48) we may take its child that contains D as our optimal tree for G . Observation 5.

Since the problem of ﬁnding a minimal (Dasgupta) cost tree is the dual problemof the revenue problem, the unweighted, complement-dense Dasgupta cost problem is NP-complete(assuming the Small Set Expansion hypothesis). roof of Theorem 5.2. Note that clearly the problem is in NP (since given a tree its dissimilaritymay be checked eﬃciently), therefore we only need to show that it is NP-hard. We do this byreducing the unweighted, complement-dense Dasgutpa cost problem to this problem.Roughly speaking we simply consider the complement graph of the HC instance. Formally, givena complement-dense HC instance G = ( V, E, w ) we deﬁne its complement as G c = ( V c , E c , w c ).Therefore, for any edge e , w c ( e ) = 1 − w ( e ). Thus,min T cost G ( T ) = min T (cid:88) w ( e ) | T e | = min T (cid:88) (1 − w c ( e )) | T e | . Dasgupta [2016] proved that for any binary tree T and for any HC instance which is a clique H itscost is ﬁxed and cost H ( T ) = ( | V ( H ) | − | V ( H ) | ). Since the optimal tree for this cost function isin fact binary we get, min T (cid:88) (1 − w c ( e )) | T e | =13 ( | V ( G ) | − | V ( G ) | ) − max T (cid:88) w c ( e ) | T e | . Since w deﬁnes a complement-dense instance, w c deﬁnes a dense instance. Thus, we reducedour original problem to max T (cid:80) w c ( e ) | T e | such that w c is dense, thereby completing the proof. Proof of Theorem 5.3.

The theorem is proven simply by rewriting the

HCC ±±