Hierarchical Clustering via Sketches and Hierarchical Correlation Clustering
Danny Vainstein, Vaggos Chatziafratis, Gui Citovsky, Anand Rajagopalan, Mohammad Mahdian, Yossi Azar
HHierarchical Clustering via Sketches and Hierarchical CorrelationClustering
Danny Vainstein ∗ Vaggos Chatziafratis † Gui Citovsky † Anand Rajagopalan † Mohammad Mahdian † Yossi Azar ‡ January 27, 2021
Abstract
Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization.In particular, two maximization objectives have been defined. Moseley and Wang defined the
Revenue objective to handle similarity information given by a weighted graph on the data points(w.l.o.g., [0 ,
1] weights), while Cohen-Addad et al. defined the
Dissimilarity objective to handledissimilarity information. In this paper, we prove structural lemmas for both objectives allowingus to convert any HC tree to a tree with constant number of internal nodes while incurring anarbitrarily small loss in each objective. Although the best-known approximations are 0.585 and0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not allweights are small (i.e., there exist constants (cid:15), δ such that the fraction of weights smaller than δ , is at most 1 − (cid:15) ); such instances encompass many metric-based similarity instances, therebyimproving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) tohandle instances that contain similarity and dissimilarity information simultaneously. For HCC,we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights(analogous to + / − correlation clustering), we again present nearly-optimal approximations. Clustering is a fundamental problem in unsupervised learning and has been widely and intensivelyexplored. Classically, one considers a set of data points (with some notion of either similarityor dissimilarity between every pair) and then partitions these data points into sets. In orderto differentiate between different partitions, many classical flat clustering objectives have beenintroduced, such as k -means, k -median and k -center. However, what if one would like a moregranular view of the clusters (specifically, to understand the relations between data points withina given cluster)?To explore these questions, the notion of Hierarchical Clustering (HC) has been introduced.One way of studying this notion is through the lens of optimization. Dasgupta [2016] initiated thisline of work, inspiring others to consider several different objectives. Two notable objectives thatwe will consider in our paper are the Revenue and Dissimilarity objectives. ∗ School of Computer Science, Tel-Aviv University and Google Research. Email: [email protected] † Google Research. Emails: { vaggos, gcitovsky, anandbr, mahdian } @google.com ‡ School of Computer Science, Tel-Aviv University. Email: [email protected]. Research upported in part by the IsraelScience Foundation (grant No. 2304/20 and grant No. 1506/16). a r X i v : . [ c s . D S ] J a n he problem is defined as follows. We are given a set of data points with some notion ofsimilarity (or dissimilarity) between every pair of points which is defined by a weighted graph, G = ( V, E, w ) such that V is our set of data points, | V | = n and w : E → R ≥ . We then definean HC tree as a rooted tree with leaves in bijective correspondence with the original data points.Intuitively, we would expect a ”good” HC tree T to split more similar data points towards theleaves of the tree. When we are given similarity weights, this corresponds to larger weights. Thus,Moseley and Wang [2017] proposed to maximize the Revenue objective: rev G ( T ) = (cid:88) i Dis-HC the best ratio is 0.667 [Charikar et al., 2019a]. In terms of hardness, both problemshave been proven to be APX-hard [Ahmadian et al., 2019, Chatziafratis et al., 2020] and thusdo not admit optimal or even arbitrarily close to optimal approximations. Given these results, itseems natural to ask whether this hardness is inherent in the objectives, or rather can be somehowcircumvented. Towards that end, we consider the following question: Is there a large class of interesting instances that can be shown to have significantly betterapproximations? Surprisingly, we show that if we consider instances with weights that are not all small (seeDefinition 3) then the above holds true. First, we obtain approximations arbitrarily close to optimal(specifically, Efficient Polynomial Time Randomized Approximation Schemes (Efficient-PRAS))for both Rev-HC and Dis-HC objectives. Interestingly, in order to do so we first consider a tree’s sketch (defined as the tree resulting from removing all its leaves (and corresponding edges)). Eventhough it is well known that the optimal trees for these settings are binary (and therefore contain n − n ) nodes), we show that there exist trees with constant sized (i.e., a constant numberof nodes and edges) sketch, for both objectives, that approximate the optimal values arbitrarilygood. We stress that this holds true for any HC instance, and not only if not all inputweights are small. We then leverage the seminal work of Goldreich et al. [1998] in order to obtainapproximations arbitrarily close to optimal, if not all weights are small.Second, we show that many interesting, and formerly researched problems, are encapsulated bythese types of instances. Specifically, we show that a large family of metric-based similarity instances(as defined by Charikar et al. [2019b] - see Subsection 3.3) are such instances, and thus admitapproximations arbitrarily close to optimal. We note that this partially answers an open question2aised in their work of whether there exist good approximation algorithms for low dimensions. Wealso note that our results immediately provide an Efficient-PRAS for similarity instances definedby a Gaussian Kernel in high dimensions when the minimal similarity is δ = Ω(1) which wasspecifically considered by Charikar et al. [2019b]; improving the approximation from δ to anapproximation that is arbitrarily close to optimal. Finally, we show that these results also providean approximation that is arbitrarily close to optimal, for the +/- Hierarchical Correlation Clusteringproblem (defined next).Up until now we have only considered instances handling either similarity or dissimilarity in-formation, but not both . In many scenarios, however, both types of information are accessiblesimultaneously. These scenarios have been tackled within the realm of correlation clustering bothin theory (e.g., Bansal et al. [2002], Swamy [2004], Charikar et al. [2005], Ailon et al. [2008], Chawlaet al. [2015]) and in practice (e.g., Bonchi et al. [2014], Cohen and Richman [2001]). However, thisline of work has been centered around flat clustering. With that in mind, it is natural to ask: In presence of mixed information, how can we extend the notion of Correlation Clustering tohierarchies? In order to answer the question, we introduce the Hierarchical Correlation Clustering objective.The objective interpolates naturally between the Rev-HC and Dis-HC objectives. Again, we aregiven a set of data points; however, in this case every pair of data points i and j are given asimilarity weight w sij and a dissimilarity weight w dij . The objective is then defined as, hcc G ( T ) = (cid:88) i Rev-HC and Dis-HC objectives simplyby letting either w dij = 0 or w sij = 0 respectively. Moreover, it captures the fact that similar points(i.e., large w sij ) should be separated towards the tree’s leaves (yielding a large n − | T ij | coefficient),whereas dissimilar points (i.e., large w dij ) should be split towards the tree’s root (yielding a large | T ij | coefficient).Finally, we consider the + / − variant of correlation clustering [Bansal et al., 2002] extended tohierarchies as well. We define this objective as the HCC objective reduced to instances that guarantee w sij = 1 − w dij for all data points i and j . We will refer to this objective as the HCC ± objective.This may be motivated by the following folklore example: assume one is given a document classifier f that returns a confidence level in [0 , 1] corresponding to how certain it is that two documentsare similar. Thus, 1 minus the confidence level may be seen as how confident the classifier is thatthe two documents are dissimilar. For further comments regarding our formulation and how itis related to the correlation clustering objectives of Bansal et al. [2002] and of Swamy [2004], seeSection 6. Contributions of this paper. With respect to the Rev-HC and Dis-HC objectives: • We present structural lemmas for the revenue and dissimilarity settings that provide a wayof converting optimal trees in both settings such that the resulting trees (1) are of constantsketch size and (2) approximate the respective objectives arbitrarily close (see Figure 1 foran example). Note that this result holds for any similarity/dissimilarity input graphs.3 We use the resulting trees in order to obtain Efficient-PRAS’s for revenue or dissimilarityinstances with not all small weights (see Definition 3). We note that this includes an Efficient-PRAS for any similarity Guassian Kernel based instances with minimal weight δ = Ω(1)(specifically considered by Charikar et al. [2019b]). • We show that many metric-based similarity instances in fact do not have all small weights,thus admitting Efficient-PRAS’s. We note that this partially solves the case where the metric’sdimension is constant (raised in Charikar et al. [2019b]).With respect to the HCC objective: • We present a 0.4767 approximation for the HCC objective by extending the proof of Alon et al.[2020] to include dissimilarity weights. • We combine our Revenue and Dissimilarity algorithms to produce an Efficient-PRAS for the HCC ± objective. Techniques. In order to reduce HC trees to trees with constant sketch that approximate the Rev-HC and Dis-HC objectives arbitrarily closely, we use the following techniques. For both ob-jectives the first step is to consider an optimal solution, T , and contract it (i.e., contract somesubgraphs of T into single nodes) into an intermediate tree denoted as K ( T ). Briefly, K ( T ) isgenerated by recursively finding a constant-sized set of edges whose removal creates a set of trees,each containing a small and roughly equal number of data points. Thereafter, each such tree iscontracted (within T ) to a single node. This results in K ( T ) that guarantees that (1) it contains aconstant number of nodes and (2) its structure resembles that of T which allows us to easily convertit to the final revenue/dissimilarity tree. Note that during this process of contraction, some datapoints may have been contracted as well (see Figure 2). Next we describe, at a high level, how toconvert K ( T ) to a proper revenue/dissimilarity tree. Revenue setting. In the revenue setting we convert K ( T ) to a tree denoted by T R , such that T R has a constant-sized sketch and approximates the revenue gained by T up to an arbitrarily smallconstant factor. In order to do so we replace each contracted node in K ( T ) with a “star” structure(which is an auxiliary node with the contracted data points connected as its children) - see Figure3. Note that there is a trade-off between T R ’s internal tree size and the revenue approximationfactor guaranteed (see Section 3 for formal details). Dissimilarity setting. In the dissimilarity setting we convert K ( T ) to a tree denoted by T D such that T D has a constant-sized sketch and approximates the dissimilarity gained by T up to anarbitrarily small constant factor. Instead of replacing the contracted node with a “star” structure asin the revenue case, we replace it with a random “comb” structure (formally defined in Section 4 anddepicted in Figure 3). Also here, there exists a trade-off between T D ’s size and the approximationfactor. Related Work. HC has been extensively studied and therefore many variations have been con-sidered (for a survey on the subject, see Berkhin [2006]). The work on HC trees began withinthe realm of phylogenetics [Sneath and Sokal, 1962, Jardine and Sibson, 1968] but has since thenexpanded to many other domains (e.g., genetics, data analysis and text analysis - Alon et al. [1999],Brown et al. [1992], Seo and Shneiderman [2002]).As stated earlier, Dasgupta elegantly linked the fields of approximation algorithms and HC trees,thereby initiating this line of work. Formally, given an HC tree, T , Dasgupta [2016] considered4he problem of minimizing its cost, cost G ( T ) = (cid:80) w ij | T ij | . In his work, Dasgupta showed thatrecursively finding a sparsest cut results in a O (log . n ) approximation. This analysis was laterimproved to O ( √ log n ) [Charikar and Chatziafratis, 2017, Cohen-Addad et al., 2018]. Charikar andChatziafratis [2017] also showed that no constant approximation exists (assuming the Small SetExpansion hypothesis).Later, Moseley and Wang [2017] considered the Rev-HC objective (defined earlier). Charikaret al. [2019a] showed a 0 . Max-Uncut Bisection problem in order to prove a 0 . . 585 approximation,by proving the existence of a bisection which yields large revenue.Cohen-Addad et al. [2018] considered the Dis-HC objective (defined earlier). In their work theyshowed that the Average-Linkage algorithm is a approximation and then improved upon this bypresenting a simple algorithm achieving a approximation. Charikar et al. [2019a] then showed afurther improvement by presenting a more intricate algorithm that achieves a 0 . d d d d d d d d d d ,d ,d ,d ,d d ,d ,d d T K ( T ) Figure 2: Converting an HC tree T to K ( T ). d ,d ,d ,d ,d d d d d d d d d d d K ( T ) Dis. ReductionRev. Reduction Auxiliary Node Auxiliary Nodes Figure 3: Converting K ( T ) to an HC tree for each goal function.5 PRELIMINARIES We first consider several graph-specific definitions. Definition 1. Given a tree T and a set of edges F ⊂ E ( T ) , let T − F denote the set of trees thatresults from removing F from E ( T ) . Furthermore, given a set of nodes U ⊂ V ( T ) , let T − U denotethe set of trees that results from removing U (and any edge that has a node in U ) from T . Definition 2. Given a graph G and a subset of edges U ⊂ V ( G ) we define the contraction of U asthe replacement of U within G with a single node attached to all edges which were formerly attachedto U . As pointed out by Charikar et al. [2019a], the average-linkage algorithm generates ( n − (cid:80) w ij revenue and n − (cid:80) w ij dissimilarity, yielding the following facts: Fact 2.1. rev ( T O ) ≥ ( n − (cid:80) i Even though the Rev-HC and Dis-HC objectives are definedfor binary trees, we make use of star structures. A star structure is simply a node that containsmore than two data points as children (and therefore leaves). We use these star structures as aproxy for any binary tree containing the same set of data points. More formally, by replacing thestar structure (within some larger tree) with any binary tree containing the same set of data pointsand then rooting it in the same place within the original tree, the goal function would only increase.In the revenue case this follows immediately. In the dissimilarity case, however, by following thedefinition of T ij plainly, clearly attaching all data points to a single root results in an optimal tree.Therefore, we instead extend the dissimilarity definition to non-binary trees as follows. Given anHC tree T and internal node v , let | T v | denote the set of data points contained within the subtreerooted at v (in particular, for any 2 data points i and j , | T ij | = | T lca ( ij ) | ). We then define thedissimilarity as dis G ( T ) = (cid:88) w ij ( | T v i | + | T v j | ) , where v i and v j denote lca ( i, j )’s children containing i and j in their subtree. We emphasize the factthat for binary HC trees, this definition coincides with the classic dissimilarity (since | T v i | + | T v j | = | T ij | ). Clearly any non-binary node may be replaced with a binary subgraph within the HC treethereby only increasing the dissimilarity generated. Therefore, any of our algorithmic results applyto the binary setting (by performing these replacements). Further, all of our approximation resultsare with respect to optimal binary trees and thus directly apply to the binary setting.Finally, we will use the following definitions throughout the paper. (Recall that w.l.o.g. we mayassume that all weights are in [0 , Definition 3. An HC instance is said to have not all small weights if there exist constants (withrespect to | V | ) ρ, τ such that the fraction of weights smaller than τ , is at most − ρ . Definition 4. An algorithm is considered an Efficient-PRAS if for any (cid:15) > the algorithm runsin time f (1 /(cid:15) ) n O (1) and approximates the optimal solution’s value up to a factor of − (cid:15) with highprobability. THE REVENUE CASE In this section we consider the Rev-HC objective. In Subsection 3.1 we show how to create a treewith constant sized sketch which approximates the optimal revenue tree up to an arbitrarily smallfactor (for an overview see Techniques). Note that this result holds for any revenue instance andthus may be of independent interest. We then leverage this and in Subsection 3.2 we present anEfficient-PRAS for instances with not all small weights. Finally, in Subsection 3.3 we show thata large family of metric-based similarity instances have weights that are not all small - therebyadmitting Efficient-PRAS’s. We note that this partially solves an open question raised by Charikaret al. [2019b] regarding constant dimension instances and immediately provides Efficient-PRAS’s forsimilarity instances defined by a Gaussian Kernel in high dimensions when the minimal similarityis δ = Ω(1) which was specifically in their work as well. We begin by first proving the existence of a tree with constant-sized sketch that approximates theoptimal tree arbitrarily well. Theorem 3.1. Let T O denote the optimal revenue tree and assume it contains n leaves (i.e., datapoints). Then, for any (cid:15) > , there exists a tree T R such that (i) T R contains Θ(1 /(cid:15) ) internalnodes each with at most (cid:15)n children, and (ii) rev ( T R ) ≥ (1 − (cid:15) ) rev ( T O ) . In order to construct T R we use a two step process: we first create an intermediate tree, denotedas K ( T ) (to be defined) and then convert that to our final tree. In fact, this process may be appliedto any binary tree T (in particular, we will apply it to T O ). Before we can define the process thatgenerates K ( T O ), we must first present several definitions and lemmas, the first of which was shownby Dasgupta [2016] (this was not explicitly proven, and therefore we add the proof in the Appendixfor completeness). Lemma 3.2. Given a rooted binary tree T with n data points as leaves, there exists an edgewhose removal creates two binary trees each with at least n data points (and therefore at most n ).Furthermore this edge can be found in polytime. Lemma 3.3. Given a rooted binary tree T with n data points, there exists a set of edges F suchthat (cid:15) ≤ | F | + 1 ≤ (cid:15) and the number of data points in each tree of T − F is at least (cid:15)n and atmost (cid:15)n . Furthermore F can be found in polytime.Proof of Lemma 3.3. Let n denote the number of data points in T . We define the following recursivealgorithm: for any binary tree instance T find the edge given by Lemma 3.2. Remove said edgeand continue recursively on both resulting trees. Stop once the input tree has less than 3 (cid:15)n datapoints.The algorithm is clearly polynomial. Let F denote the set of resulting edges. Due to ourstopping condition, every tree in T − F contains between (cid:15)n and 3 (cid:15)n data points. Therefore, (cid:15) + 1 ≤ | F | ≤ (cid:15) for (cid:15) < / Lemma 3.4. For an arbitrary tree T , let V denote the set of vertices with degree ≥ and L denoteits set of leaves. Then, | V | ≤ | L | − . roof. Let T be some tree on n nodes and let (cid:96) denote some leaf. We prove by induction on n . If n = 1 or n = 2 clearly we are done. Otherwise, traverse T starting at (cid:96) (i.e., hopping from a nodeto one of its untravelled neighbours). If during this traversal we arrive at a leaf before we arrive ata node with degree ≥ 3, then | V | = 0 and we are done. Otherwise let u denote the first node wetraverse with degree ≥ 3. Remove all nodes in the traversal upto but not including u , denote thenew tree as T (cid:48) .Thus, | V | ≤ | V (cid:48) | + 1 and | L | − | L (cid:48) | . Furthermore, since T (cid:48) has at most n − | V | ≤ | V (cid:48) | + 1 ≤ | L (cid:48) | = | L | − . Definition 5. Given F as defined by Lemma 3.3 we define two sets of nodes: blue and green,denoted by B and G . A blue node is any node connected to any edge of F or that is T ’s root. A green node is any node that is not blue and that has two children, each of which contains a bluenode as its descendant. Next we define the process that given a binary tree, contracts it compactly. Given an input T , wedenote the process’ output as K ( T ), formally defined by Algorithm 1. (See Figure 2 for a pictorialexample). We note that each contracted node might have originally contained data points. Wetherefore associate every contracted node, c with its set of data points, D c . Finally, we define theprocess that given any binary tree T , outputs T R - formally defined by Algorithm 2. Algorithm 1: Algorithm to convert T to K ( T ).Obtain F as described in Lemma 3.3.Color the nodes green or blue as in Definition 5. for every tree T i in T − ( B ∪ G ) do Contract T i .Return the resulting tree as K ( T ). Algorithm 2: Algorithm to convert T to T R . K ( T ) ← Algorithm 1 applied to T . for each node c ∈ K ( T ) and its set of data points D c do Attach a (new) auxiliary node as c ’s child (in K ( T )).Attach D c as the auxiliary node’s children.Return the resulting tree as T R . Remark. We note that T R remains binary (except the auxiliary nodes). This is in fact true sinceotherwise this internal node would have contained at least 2 children which are colored green/blue(since it may only have a single auxiliary node). Thus, there would have been a green node containedwithin this contracted component in contradiction to the definition of K ( T ) . In what follows we show that for any binary tree T , (1) T R has a constant sketch and (2) | T Rij | is(approximately) upper bounded for any data points i and j (which in turn guarantees that rev ( T R )is close to T O when T = T O ). 8 emma 3.5. T R contains Θ(1 /(cid:15) ) internal nodes each with at most (cid:15)n children.Proof. We first note that a node is a leaf in T R if and only if it was a leaf in T (since everycontracted connected component either contained data points or will have a child following thecontraction). Next, we categorize the internal nodes of T R . These nodes are either colored (greenor blue), or they are a contracted node or an auxiliary node. We denote the set of each such nodesby G , B , C and A respectively.It is not hard to see that the second part of our lemma holds. This is due to the fact thatby Remark 3.1 every node in G , B and C has at most 2 immediate children. For nodes in A , byLemma 3.3 and by A ’s definition, we are guaranteed that any such node has at most 3 (cid:15)n children.In order to show the first part of the lemma we bound each of the four sets of nodes. By thedefinition of B , | B | ≤ /(cid:15) . By definition of A , | A | ≤ | C | . Furthermore, every node in C has aparent that is colored green or blue and thus due to Remark 3.1, | C | ≤ | G | + | B | ). Therefore, | A | + | C | ≤ | G | + | B | ).Next we bound | G | . In order to do so, we first simplify T R in a way that does not affect | G | .Since no auxiliary node contains green nodes in their subtree, we may detach them without affectingany green or blue nodes. Furthermore, this removal upholds the fact that any green node’s degreeis at least 3 (since we did not remove any blue nodes). We then also remove any contracted nodewhich now happens to be a leaf (since they too, do not affect the green or blue nodes).Therefore, in the resulting tree, any leaf must be blue and any green node must have degree atleast 3. Thus, if we denote by V the set of vertices with degree ≥ L the set of leaves,then, | G | ≤ | V | ≤ | L | − ≤ | B | − , where the second inequality is due to Lemma 3.4. Thus, | A | + | C | + | G | + | B | ≤ | G | + | B | ) ≤ | B | ≤ /(cid:15). Now, in order to show the complement (i.e., T R contains Ω(1 /(cid:15) ) internal nodes) it is enough toconsider Lemma 3.3 thereby concluding the proof. Lemma 3.6. For any two data points i and j , | T Rij | ≤ | T ij | + 6 (cid:15)n .Proof. Consider any three data points in T , i, j and k , such that k (cid:54)∈ T ij . We will show that k (cid:54)∈ T Rij for all but 6 (cid:15)n such k ’s. In order to prove our lemma we first introduce the following notations.First, for any node u we denote the set of data points contained in its induced subtree as L ( u ).Secondly we note that any node colored green or blue in T will not be contracted and thereforewill appear in V ( T R ). Finally, we observe the following given our contraction process. Observation 1. Let v ∈ V ( T ) denote a child of a green/blue node and let v ∗ ∈ V ( T R ) denote thenode that contracted v in T R . Therefore, L ( v ) = L ( v ∗ ) . Observation 2. Data points i and j appear under the same auxiliary node in T R if and only if i and j were contained in the same tree of T − ( B ∪ G ) . Recall that our goal is to show that if k (cid:54)∈ T ij then k (cid:54)∈ T Rij . Towards that end, denote by v ij (resp. v ik and v jk ) i and j ’s LCA in T . Therefore, v ik = v jk and v ij is a descendant of v ik .Furthermore, let { T B ∪ G(cid:96) } denote the set of trees defined by T − ( B ∪ G ) and let T B ∪ Gi (resp. T B ∪ Gj and T B ∪ Gk ) denote the tree in T − ( B ∪ G ) containing i (resp. j and k ).9e first assume k (cid:54)∈ T B ∪ Gi and k (cid:54)∈ T B ∪ Gj . Therefore, a green or blue node must be either onthe path k → v ik , or on the path v ij → v ik . Otherwise there must be a green or blue node on thepath i → v ij and on the path j → v ij . We consider each case separately. (See Figure 4). v i v j v k v ij v ik = v jk case 1case 2case 3 Figure 4: Explanation to proof of Lemma 3.6 (such that v a = a for a ∈ { i, j, k } ). Case 1. There exists a blue or green node on the path k → v ij : We further split this case intotwo cases. The first is that i and j are part of the same tree of T − ( B ∪ G ). In this case they willend up under the same auxiliary node and due to Observation 2 we are guaranteed that k (cid:54)∈ T Rij .The second case is that i and j are not part of the same tree and therefore there exists a blue/greennode on the path i → j . Thus, the node v ik must be green or blue and due to Observation 1, i and j ’s lca will remain lower than i and k ’s in T R . Therefore, k (cid:54)∈ T Rij . Case 2. There exists a blue or green node on the path v ij → v ik : In this case either v ik isgreen/blue and due to Observation 1 we are done. Otherwise some other node along v ij → v ik isgreen/blue and then Observation 1 guarantees that k will not enter the subtree defined by i and j ’s lca. Thus, in any case, k (cid:54)∈ T Rij . Case 3. There exists a green or blue node on the paths i → v ij and j → v ij : If v ij is green/bluethen Observation 1 guarantees that k will not enter the subtree defined by i and j ’s lca. Otherwise,we are guaranteed to have two separate green/blue nodes, one on the path i → v ij and one on thepath j → v ij . Therefore, v ij must be green/blue. Hence, in either case, k (cid:54)∈ T Rij .Thus, we have shown that in all 3 cases if k (cid:54)∈ T B ∪ Gi and k (cid:54)∈ T B ∪ Gj then k (cid:54)∈ T Rij . Since thenumber of data points within both T B ∪ Gi and T B ∪ Gj is at most 3 (cid:15)n each, we get that at most 6 (cid:15)n such k ’s may be contained in T Rij . Therefore, | T Rij | ≤ | T ij | + 6 (cid:15)n , concluding the proof.Finally, combining Lemmas 3.5 and 3.6 for T = T O (i.e., the revenue optimal solution) with Fact2.1, is enough to prove Theorem 3.1. Proof of Theorem 3.1. Lemma 3.5 is enough to prove the first bullet. We consider the secondbullet. It is a known fact that T O may be taken to be binary. Therefore, due to Lemma 3.6 andFact 2.1, we get, rev ( T R ) = (cid:88) i 0, let | V | = n and k = (cid:100) (cid:15) (cid:101) . Finally, let T R(cid:15) denote the tree guaranteed by Theorem3.1 for (cid:15) . We may define T R(cid:15) ’s revenue as follows. For every one of T R(cid:15) ’s internal nodes i , denoteby D i its set of children that are data points. Furthermore, let W ij denote the total weight of theset of (similarity) edges crossing between D i and D j . Therefore, rev ( T R(cid:15) ) = (cid:80) i EPRAS for Revenue case.Enumerate over all trees, T , with k internal leaves. for each such T dofor { α i } i ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } dofor { β ij } i ≤ k,j ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } do Run P T ( { α i } , { β ij } , (cid:15) err = (cid:15) , δ ).Compute the revenue given T and P T ’s output.Return the maximal revenue tree encountered. Lemma 3.7. For every (cid:15) > , Algorithm 3 guarantees an approximation factor of (1 − (cid:15) − (cid:15)ρτ ) . We note that the error from the property tester is offset by the revenue from the optimalsolution. Theorem 3.8. Algorithm 3 is an Efficient-PRAS.Proof. Lemma 3.7 guarantees that there exists ˆ (cid:15) > (cid:15) = 18 (cid:15) + (cid:15)ρτ ) such that ouralgorithm is a 1 − ˆ (cid:15) approximation. The property tester runs in time, exp (log( δ(cid:15) err )( O (1) (cid:15) err ) k +1 ) +11 ( log( k/ ( (cid:15) err δ )) (cid:15) err ) n . Further, we call the tester k k · (3 /(cid:15) ) k · (9 /(cid:15) ) k times. Now, since (cid:15) < ˆ (cid:15) , if (cid:15) err = (cid:15) then the algorithm is an Efficient-PRAS. We follow the definitions as seen in Charikar et al. [2019b]. Suppose that our data points lie ona metric M with doubling dimension D ( M ). Define a non-increasing function g : R ≥ → [0 , i and j let d ij denote their distance as defined by our metric. Furthermore,we define the metric-based similarity weights w ij = g ( d ij ).Define A ( (cid:15) ) = A to be the tree generated by the algorithm that adds a constant (cid:15) to all weightsand then runs Algorithm 3 for ρ, τ -weighted instances. We note that A is well defined since thealtered weights define a graph with not all small weights for τ = (cid:15) and ρ = 0.The following theorem shows that for a large class of functions g and metrics M , algorithm A is in fact an Efficient-PRAS. Theorem 3.9. Assume the metric’s doubling dimension guarantees D ( M ) = O (1) and g is scaleinvariant and (cid:96) -Lipschitz continuous for (cid:96) = O (1) . Then, A is an Efficient-PRAS for the inducedRevenue instance.Proof. Let w ij = g ( d ij ) and let w (cid:48) ij = w ij + (cid:15) . Denote by O and O (cid:48) the trees which generate themaximal revenue with respect to w ij and w (cid:48) ij respectively. Finally, given an HC tree T , let Rev ( T )and Rev (cid:48) ( T ) denote the revenue generated by T with respect to w ij and w (cid:48) ij respectively.By Theorem 3.8 we are guarnateed that for any constant δ > Rev (cid:48) ( A ) ≥ (1 − δ ) Rev (cid:48) ( O (cid:48) ).Furthermore, by the definitions of O and O (cid:48) we have that Rev (cid:48) ( O (cid:48) ) ≥ Rev (cid:48) ( O ). Therefore, Rev (cid:48) ( A ) ≥ (1 − δ ) Rev (cid:48) ( O (cid:48) ) ≥ (1 − δ ) Rev (cid:48) ( O ) . (1)By Fact 2.3 and since w ij + (cid:15) = w (cid:48) ij we are guaranteed that for any tree T , Rev ( T ) = Rev (cid:48) ( T ) − (cid:15) n (cid:0) n (cid:1) . Combining this with equation 1 we get that, Rev ( A ) = Rev (cid:48) ( A ) − (cid:15) n (cid:18) n (cid:19) ≥ (1 − δ ) Rev (cid:48) ( O ) − (cid:15) n (cid:18) n (cid:19) = (1 − δ ) Rev ( O ) − δ(cid:15) n (cid:18) n (cid:19) . Let α denote the diameter of the metric. Since the metric is scale invariant we may assumew.l.o.g. that α = 1. By the definition of the doubling dimension, D ( M ) = D , there are 2 D ( (cid:96) +1) balls of radius (cid:96) +1 that cover the entirety of the data. Let x i denote the number of data pointsthat belong to the i ’th ball but not to balls 1 , . . . , i − 1. Therefore, (cid:80) D ( (cid:96) +1) i =1 x i = n . On the otherhand by Cauchy-Schwarz inequality, (cid:80) D ( (cid:96) +1) i =1 x i ≥ n D ( (cid:96) +1) . Therefore, the number of pairs of datapoints within the same ball is (cid:80) D ( (cid:96) +1) i =1 (cid:0) x i (cid:1) ≥ n D ( (cid:96) +1)+1 − n . Due to the fact that pairs of pointsthat belong to the same ball are at distance of at most (cid:96) and since similarity function g is defined12n non-increasing, we get that, (cid:88) i,j w ij ≥ g ( 12 (cid:96) ) D ( (cid:96) +1) (cid:88) i =1 (cid:18) x i (cid:19) ≥ g ( 12 (cid:96) ) (cid:0) n D ( (cid:96) +1)+1 − n (cid:1) . (2)By Fact 2.1 and equation 2 we are guaranteed that for c = D ( (cid:96) +1) g ( (cid:96) ) , cδ(cid:15)Rev ( O ) ≥ δ(cid:15) n (cid:0) n (cid:1) .Combining the above, Rev ( A ) ≥ (1 − δ − cδ(cid:15) ) Rev ( O ) . Due to the fact that g (0) = 1 and that g is (cid:96) -Lipschitz continuous, g ( (cid:96) ) = Ω(1). On the otherhand since D = O (1) and (cid:96) = O (1) we may choose (cid:15) and δ small enough in order to guarantee anEPRAS. In this section we show how to create a tree that approximates the optimal dissimilarity value.This tree is produced by taking K ( T O ) for the optimal tree, T O (as defined earlier) and altering it.As opposed to the revenue case, this theorem guarantees O (1 /(cid:15) ) internal nodes while maintaininga (1 − (cid:15) ) approximation. Note that this result holds for any dissimilarity instance and thus may beof independent interest. For an overview we refer the reader to our Techniques section. Theorem 4.1. Let T O denote the optimal dissimilarity tree and assume it contains n leaves (i.e.,data points). Then, for any (cid:15) > , there exists a tree T D such that (i) T D contains Θ(1 /(cid:15) ) internalnodes, each with at most (cid:15) n children, and (ii) dis ( T D ) ≥ (1 − (cid:15) ) dis ( T O ) . In order to obtain T D given a binary tree, T , we use K ( T ) (as defined in Section 3). We thenconvert K ( T ) to T D , by randomly partitioning each contracted node’s data points into 1 /(cid:15) clustersand attaching them in a “comb”-like structure. The process is defined in Algorithm 4 (see Figure3 for an example). Algorithm 4: Algorithm to convert T to T D . K ( T ) ← Algorithm 1 applied to T . for each node c ∈ K ( T ) and its data points D c do Partition D c into 1 /(cid:15) random sets of equal sizes, P = { P , . . . , P /(cid:15) } . for P i ∈ P do Create a new auxiliary node, u i .Attach P i as u i ’s children.Create a new node (cid:96) i , and attach it between c and its parent.Attach u i as (cid:96) i ’s child.Return the resulting tree as T D . 13ote that D c = ∅ if c is the root (since the root is blue) and therefore (cid:96) i is indeed only defined for c ’s that have a parent. Also note that as in Remark 3.1, T D remains binary if we disregard theauxiliary nodes. Next we show that T D is of constant size and that | T Dij | is (approximately) lowerbounded. Lemma 4.2. T D contains at most /(cid:15) and at least /(cid:15) internal nodes with at most (cid:15) n children. Lemma 4.3. The resulting tree, T D , guarantees in expectation that, | T Dij | ≥ (1 − (cid:15) ) | T ij | − (cid:15)n . We defer the proofs of Lemmas 4.2 and 4.3 to the Appendix. Finally, combining Lemmas 4.2and 4.3 for T = T O with Fact 2.2, is enough to prove Theorem 4.1. (For the formal proof, seeAppendix). In this section we consider the problem of finding an optimal dissimilarity tree in instances withweights that are not all small and present an Efficient-PRAS. As in the revenue case, again we showthat this is the best one could hope for, and complement our result by showing that the problemis NP-Complete and thus does not admit an optimal, polynomial solution (see Theorem 5.2 in theAppendix)Let (cid:15) > T D(cid:15) denote the tree guaranteed by Theorem 4.1 for (cid:15) . As in the revenuecase, for an internal node of T D , i , let D i denote the set of data points that are i ’s children andlet W ij denote the set of (dissimilarity) edges crossing between D i and D j . Therefore, dis ( T D(cid:15) ) = (cid:80) i,j ∈ S (cid:0) W ij (cid:80) (cid:96) ∈ S | D (cid:96) | (cid:1) + b , where the second sum is over all sets D (cid:96) contained in T Dij (as defined by T D(cid:15) ’s sketch). Furthermore, b is defined as the dissimilarity gained by nodes within the same ”star”structure. Theorem 4.1 guarantees that | D i | is small - therefore, since our instance has weightsthat are not all small (and by Fact 2.2 the optimal solution is large) this dissimilarity is negligibleand we may assume b = 0 since we already lose a factor of 1 − (cid:15) . Finally, recall that | S | ≤ k .Our Efficient-PRAS follows as in the revenue case and is therefore deferred to the Appendix(Algorithm 7). The following theorem is proven identically to the revenue case and is thereforeomitted. Theorem 4.4. Algorithm 7 is an EPRAS for dissimilarity instances with weights that are not allsmall. When considering instances with weights that are not all small, we have only shown Efficient-PRAS’s up until now. To complement our results, we show that we can not hope for optimal,polynomial algorithms, assuming the Small Set Expansion (SSE) hypothesis. (For a formal def-inition of SSE see Charikar and Chatziafratis [2017]). In fact, it is enough to show that theseobjectives are NP-complete assuming the instances are (1) unweighted and (2) guarantee that (cid:80) i The Revenue objective for dense instances is in NPC (assuming SSE). Theorem 5.2. The Dissimilarity objective for dense instances is in NPC (assuming SSE). Theorem 5.3. The HCC ± objective is in NPC (assuming SSE). HIERARCHICAL CORRELATION CLUSTERING In this section we consider the case where the collected data may contain both similarity and dissimi-larity information. We first show a worst case approximation and thereafter show an Efficient-PRASfor HCC ± . Here we consider two separate algorithms which, if combined properly, will yield our approximation.The first is a simple greedy algorithm whereas the second optimizes for the Max-Uncut Bisection problem for its top most cut and then continues with the greedy algorithm. We first show baselineguarantees of the greedy algorithm and then use the work of Alon et al. [2020] in order to obtainguarantees on the second algorithm with respect to the HCC objective. We defer the following proofto the appendix. Proposition 6.1. There exists a greedy algorithm, denoted by ALG GRE , that returns an HC tree T guaranteeing, hcc ( T ) ≥ ( n − (cid:88) ij w sij + n (cid:88) ij w dij . Denote by ALG MUB the algorithm that generates an HC tree by first cutting according to Max-Uncut Bisection based on the similarity weights of the instance and then running ALG GRE oneach of the two resulting sides. Let OPT = OPT s + OPT d be the value of the optimum HCC tree where OPT s = (cid:80) w sij ( n − | O ij | ) and OPT d = (cid:80) w dij | O ij | , defined such that O ij denotes the number of leavesin the subtree rooted at the LCA of i and j in the tree of OPT . Lemma 6.2. Let T denote the HC tree returned by ALG MUB . Therefore, hcc G ( T ) ≥ . · OPT s + · OPT d Proof. For ease of exposition let T = T . The top-split of T is a bisection which means that | L | = | R | = n . For ease of notation let: W sL = (cid:88) i,j ∈ L w sij and W dL = (cid:88) i,j ∈ L w dij Similarly, we define W sR and W dR . Notice that for the L side, Greedy will contribute at least · n · W dL to (cid:80) w dij | T ij | , as per Proposition 6.1. Similarly, for the R side. This means that in thetree T , any edge contributes either · n (if it was cut by Greedy ) or n (if it was cut at the top-splitof Max-Uncut Bisection ). In any case, we have: (cid:88) w dij | T ij | ≥ · n (cid:88) w dij ≥ OPT d (3)by using the upper bound OPT d ≤ n (cid:80) w dij .We now deal with OPT s . Observe that: (cid:88) w dij ( n − | T ij | ) ≥ W + L ( n + n ) + W sR ( n + n ) ≥ n ( W sL + W sR )15ince every edge within L will contribute n due to the bisection, plus an extra n due to the greedystep. The same is true for edges in R .Finally, since we used a 0.8776 for Max-Uncut Bisection , it holds directly from Alon et al.[2020] that: (cid:88) w dij ( n − | T ij | ) ≥ · . · OPT s ≥ . · OPT d (4)The lemma follows by summing eq. (3) and (4).Finally, we combine Proposition 6.1 and Lemma 6.2 in order to yield the following Theorem(whose proof is defered to the appendix). Theorem 6.3. Running ALG GRE with probability p and otherwise ALG MUB guarantees an ap-proximation of 0.4767 for the HCC objective, when p = 0.43. Here we consider the HCC ± objective (as defined earlier in the introduction) and show an Efficient-PRAS. We also complement our results and show that in fact this problem is NP-Complete andthus we cannot hope for an optimal, polynomial solution (see Theorem 5.3 in the Appendix).Let ALG ± denote the algorithm that runs Algorithm 3 and Algorithm 7 simultaneously andreturns the tree maximizing the HCC ± objective. We prove that ALG ± is in fact an Efficient-PRASfor the HCC ± objective. We defer the theorem’s proof to the appendix. Theorem 6.4. ALG ± is an Efficient-PRAS for the HCC ± objective. In this paper we show that to optimize for the Rev-HC and Dis-HC objectives, it suffices to considerHC trees with constant-sized sketches, thereby greatly simplifying these problems. This result canbe applied to both the heuristic setting (since it greatly reduces the range of optimal solutions thatneed to be considered) and the approximation setting. Specifically, an approximation algorithmmay iterate over all constant sized trees. Thereafter, it will need to partition the data points intothe leaves of the constant-sized tree - thus reducing our problem to the well-studied realm of graphpartitioning problems.We then consider the family of instances with weights that are not all small. We show Efficient-PRAS’s for both Rev-HC and Dis-HC objectives. Furthermore, we show that this family of instancesencompasses many metric-based similarity instances. Finally, we introduce the HCC objective whichwe hope will provide a better connection between the realms of correlation and hierarchical clus-tering. We then show a worst case approximation of 0.4767 and show an Efficient-PRAS for the HCC ± objective that leverages our algorithms presented for the Rev-HC and Dis-HC objectives forinstances with weights that are not all small. The authors would like to deeply thank Claudio Gentile and Fabio Vitale for their helpful discussionsand insights regarding the connection to metric-based similarity instances. We also thank SaraAhmadian and Alessandro Epasto for interesting discussions during early stages of our work.16 eferences Sara Ahmadian, Vaggos Chatziafratis, Alessandro Epasto, Euiwoong Lee, Mohammad Mahdian,Konstantin Makarychev, and Grigory Yaroslavtsev. Bisect and conquer: Hierarchical clusteringvia max-uncut bisection. CoRR , abs/1912.06983, 2019.Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: rankingand clustering. Journal of the ACM (JACM) , 55(5):1–27, 2008.Noga Alon, Yossi Azar, and Danny Vainstein. Hierarchical clustering: A 0.585 revenue approxi-mation. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory,COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria] , volume 125 of Proceedings of Ma-chine Learning Research , pages 153–162. PMLR, 2020. URL http://proceedings.mlr.press/v125/alon20b.html .U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broadpatterns of gene expression revealed by clustering analysis of tumor and normal colon tissuesprobed by oligonucleotide arrays. Proceedings of the National Academy of Sciences , 96(12):6745–6750, 1999. ISSN 0027-8424. doi: 10.1073/pnas.96.12.6745. URL .Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. In , page 238, 2002.Pavel Berkhin. A survey of clustering data mining techniques. Grouping Multidimensional Data ,pages 25–71, 2006.Francesco Bonchi, David Garcia-Soriano, and Edo Liberty. Correlation clustering: from theory topractice. In KDD , page 1972, 2014.Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza, Jennifer C. Lai, and Robert L. Mercer.Class-based n-gram models of natural language. Computational Linguistics , 18(4):467–479, 1992.Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cutand spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19 , pages841–854, 2017.Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative infor-mation. Journal of Computer and System Sciences , 71(3):360–383, 2005.Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better thanaverage-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019 , pages 2291–2304,2019a.Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, and Grigory Yaroslavtsev. Hierarchical clus-tering for euclidean data. In The 22nd International Conference on Artificial Intelligence andStatistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan , pages 2721–2730, 2019b.URL http://proceedings.mlr.press/v89/charikar19a.html .17aggos Chatziafratis, Neha Gupta, and Euiwoong Lee. Inapproximability for local correlationclustering and dissimilarity hierarchical clustering. arXiv preprint arXiv:2010.01459 , 2020. URL https://arxiv.org/abs/2010.01459 .Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near optimallp rounding algorithm for correlationclustering on complete and complete k-partite graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages 219–228,2015.William Cohen and Jacob Richman. Learning to match and cluster entity names. In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval , 2001.William W Cohen and Jacob Richman. Learning to match and cluster large high-dimensional datasets for data integration. In Proceedings of the eighth ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 475–480, 2002.Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchicalclustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10,2018 , pages 378–397, 2018.Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA,USA, June 18-21, 2016 , pages 118–127, 2016.Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection:A survey. IEEE Transactions on knowledge and data engineering , 19(1):1–16, 2006.Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximumcut and satisfiability problems using semidefinite programming. J. ACM , 42(6):1115–1145, 1995.Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learningand approximation. J. ACM , 45(4):653–750, 1998.N Jardine and R Sibson. A model for taxonomy. Mathematical Biosciences , 2(3-4):465–482, 1968.Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, and Chang D Yoo. Higher-order correlationclustering for image segmentation. In Advances in neural information processing systems , pages1530–1538, 2011.Benjamin Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Averagelinkage, bisecting k-means, and local search. In Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA , pages 3094–3103, 2017.Anirudh Ramachandran, Nick Feamster, and Santosh Vempala. Filtering spam with behavioralblacklisting. In Proceedings of the 14th ACM conference on Computer and communicationssecurity , pages 342–351, 2007. 18inwook Seo and Ben Shneiderman. Interactively exploring hierarchical clustering results. IEEEComputer , 35(7):80–86, 2002. doi: 10.1109/MC.2002.1016905. URL https://doi.org/10.1109/MC.2002.1016905 .Peter HA Sneath and Robert R Sokal. Numerical taxonomy. Nature , 193(4818):855–860, 1962.Chaitanya Swamy. Correlation clustering: maximizing agreements via semidefinite programming.In J. Ian Munro, editor, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004 , pages 526–527.SIAM, 2004. URL http://dl.acm.org/citation.cfm?id=982792.982866 .19 DEFERRED PROOFS OF SUBSECTION 3.1 Proof of Lemma 3.2. We first note that the removal of any edge creates two binary trees. Next weshow how to find an edge satisfying the rest of the properties.Given the rooted tree T , we travel down the tree from the root such that we always pick thechild that contains more data points in its subtree (compared to the other child, if another childexists). We denote the i ’th node along this path that contains exactly two children, by u i for i ∈ { , , . . . } . Furthermore, we denote the sets of data points contained by its two children by A i and B i such that, | A i | ≥ | B i | .Let k ∗ := arg min i {| B | + · · · + | B i | ≥ n } . Since | A k ∗ | + | B | + · · · | B k ∗ | = n , we are guaranteedthat | A k ∗ | ≤ n . On the other hand, since | A k ∗ | ≥ | B k ∗ | and | A k ∗ | + | B k ∗ | = n − ( | B | + · · · | B k ∗ − | )we are also guaranteed that, | A k ∗ | ≥ n .Therefore, removing the edge between u k ∗ and its child associated with A k ∗ guarantees that theresulting trees each have at most at least n/ B DEFERRED PROOFS AND DEFINITIONS OF SUBSECTION3.2 Observation 3. Due to Fact 2.1 if we denote by T O our optimal solution, then since our instanceis ρ, τ -weighted we get, rev ( T O ) ≥ ρτ n , for some smaller, yet still constants ρ and τ .Proof of Lemma 3.7. Let T alg denote the tree returned by Algorithm 3. Furthermore denote by α (cid:96) and β ij the real values of T R(cid:15) . Therefore, rev ( T alg ) ≥ (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) ( α (cid:96) − n(cid:15) − n(cid:15) err ) · ( β ij − n (cid:15) − n (cid:15) err ) (cid:1) ≥ (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) α (cid:96) β ij (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) β ij n(cid:15) (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) β ij n(cid:15) err (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) α (cid:96) n (cid:15) (cid:1) − (cid:88) i ≤ j (cid:88) (cid:96) ∈ S (cid:0) α (cid:96) n (cid:15) err (cid:1) ≥ (cid:0) (cid:88) i ≤ j (cid:88) (cid:96) ∈ S α (cid:96) β ij (cid:1) − n (cid:15) k − n (cid:15) err k − n (cid:15) (20 k ) − n (cid:15) err (20 k ) = (cid:0) (cid:88) i ≤ j (cid:88) (cid:96) ∈ S α (cid:96) β ij (cid:1) − n (cid:0) (cid:15) k + (cid:15) (20 k ) + (cid:15) err k + (cid:15) err (20 k ) (cid:1) ≥ rev ( T R(cid:15) ) − n (421 (cid:15) + 20 k(cid:15) err + 400 k (cid:15) err ) , α (cid:96) and β ij to their exact values. The third inequality follows since there are at most k setsin the partition, (cid:80) β ij ≤ n and (cid:80) α (cid:96) ≤ n . The last inequality is due to the fact that k ≤ /(cid:15) + 1and (cid:15) is chosen to be small enough.Due to Observation 3, Theorem 3.1 and by choosing (cid:15) err = (cid:15) , we get, rev ( T alg ) ≥ rev ( T R(cid:15) ) − n ( O ( (cid:15) )) ≥ rev ( T R(cid:15) ) − O ( (cid:15) ) ρτ rev ( T O ) ≥ (1 − O ( (cid:15) ) − O ( (cid:15) ) ρτ ) rev ( T O ) . Thus by choosing (cid:15) small enough, we get the desired result. C DEFERRED PROOFS OF SUBSECTION 4.1 Proof of Lemma 4.2. Consider the proof of Lemma 3.5. The only difference between T R and T D (with respect to the number of their internal nodes) is the fact that in T D the contracted nodes aremultiplied by 1 /(cid:15) (and therefore the auxiliary nodes as well). Thus, clearly the lemma holds. Proof of Lemma 4.3. In order to prove the lemma we consider the following observations. The firstof which is Observation 1 which holds here as well. The second is the following. Observation 4. Consider any two data points, i and j , that are contained in the same contractednode in K ( T O ) . Further assume that they end up under different auxiliary nodes. Therefore, anydescendant of the corresponding contracted node (in K ( T O ) ) is contained in T Dij . Consider two data points in T O , i and j and consider some k ∈ T Oij . As before, we denote theirlca’s by v ik , v jk and v ij and assume without loss of generality that i is clustered first with k andtherefore, v ij = v kj .We would like to bound the number of k ’s for which k (cid:54)∈ T Dij . As before, let { T B ∪ G(cid:96) } denotethe set of trees defined by T O − ( B ∪ G ) and let T B ∪ Gi (resp. T B ∪ Gj and T B ∪ Gk ) denote the tree in T O − ( B ∪ G ) containing i (resp. j and k ). If k ∈ T B ∪ Gi or k ∈ T B ∪ Gj then since the number of datapoints contained in these trees is at most 6 (cid:15)n , we may disregard such k ’s and incur an additive lossof 6 (cid:15)n . Therefore, we assume, k (cid:54)∈ T B ∪ Gi and k (cid:54)∈ T B ∪ Gj .Thus, we split into the following cases. The first is the case where v jk is green/blue. Otherwise,this means that v jk has at most one child with a blue descendant. It can not be the child containing j since that would mean that k ∈ T B ∪ Gi . Thus, we may only consider the following final cases:either exists a green/blue node on the path v ik → v ij or there must exist a green/blue node bothon the path k → v ik and on the path i → v ik (since k (cid:54)∈ T B ∪ Gi ). Otherwise, exists a green/bluenode on the path k → v ik and not on the path i → j .We prove our lemma for each of these cases.1. v jk is green/blue: Due to Observation 1 we are guaranteed that k ∈ T Dij .2. There exists a green/blue node on the path v ik → v ij : Due to Observation 1 we are guaranteedthat k ∈ T Dij . 21. There exists a green/blue node both on the path k → v ik and on the path i → v ik : In this case v ik is green/blue and therefore, again due to Observation 1 we are guaranteed that k ∈ T Dij .4. There exists a green/blue node on the path k → v ik and not on the path i → j : In this case i and j are in the same contracted node in K ( T O ). If they end up under different auxiliarynodes, then by Observation 2 k ∈ T Dij . Since we partitioned the data points in the contractednodes randomly (under restriction that the sets are of the same size), the probability that i and j will end up under different auxiliary nodes is ≥ (1 − (cid:15) ).Thus, in any case, E [ | T Dij | ] ≥ (1 − (cid:15) ) | T Oij | − (cid:15)n . Proof of Theorem 4.1. Lemma 4.2 guarantees the first bullet. For the second bullet, denote by T O the optimal solution. We note that T O is binary. Furthermore, due to Lemma 4.3 and Fact 2.2,we get, E [ dis ( T D )] = (cid:88) i 38 we get the desired result. D DEFERRED ALGORITHMS OF SUBSECTION 4.2 Algorithm 5: EPRAS for the dense dissimilarity case.Enumerate over all trees, T , with k internal leaves. for each such T dofor { α i } i ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } dofor { β ij } i ≤ k,j ≤ k ⊂ { i(cid:15) n : i ∈ N ∧ i ≤ (cid:15) } do Run P T ( { α i } , { β ij } , (cid:15) err = (cid:15) , δ ).Compute the dissimilarity based on T and P T ’s output.Return the maximal dissimilarity tree encountered. E DEFERRED PROOFS OF SECTION 6 Proof of Proposition 6.1. For each vertex v ∈ V , our algorithm maintains scores s ( v ) which areinitially set to zero. The algorithm will actually remove the node of largest score at each step andrecurse on the remaining vertices, hence producing a caterpillar tree (a tree whose every internalnode has at least one leaf). A similar greedy strategy to the one described below can also produce a22ree (not necessarily caterpillar) in a bottom-up fashion by repeatedly merging node pairs. Noticethat the algorithm is deterministic.For every edge ( i, j ) of similarity weight w sij , decrease s ( i ) and s ( j ) by n − w sij , and increaseevery other score s ( k ) by w sij , where k ∈ V \ { i, j } . The intuition behind such assignments, is thatfor a pair i, j of similarity w sij , whenever we remove another node k first, k ’s contribution to the hcc objective increases by w sij , as k lies outside of the lowest common ancestor between i, j . Similarly,for every edge ( i, j ) of dissimilarity w dij , we increase s ( i ) and s ( j ) by n w dij , and decrease every otherscore s ( k ) by w dij , where k ∈ V \ { i, j } .Next, let u ∈ V have the largest score and V (cid:48) = V \ { u } . Remove u and any adjacent edgesfrom the graph, then recursively construct a tree T (cid:48) restricted on V (cid:48) for its leaves (if | V (cid:48) | = 2, justoutput the unique binary tree on the two nodes). The final output of the algorithm is a new tree T with one child being u and the other child being the root of T (cid:48) .We now prove correctness: Let u as above and let w su = (cid:80) ( u,v ) w suv , w du = (cid:80) ( u,v ) w duv , W s = (cid:80) ( i,j ) w sij , W d = (cid:80) ( i,j ) w dij . Notice that according to the scoring rule of our algorithm: s ( u ) = ( W s − w su ) − n − w su − ( W d − w du ) + n w du Note that by induction, tree T (cid:48) that has n − hcc ( T (cid:48) ) ≥ ( n − W s − w su ) + ( n − W d − w du ) (5)Since u had the largest score, it follows that s ( u ) ≥ 0. Therefore:( W s − w su ) − ( W d − w du ) ≥ n − w su + n w du We add [( W s − w su ) − ( W d − w du )] to both sides:( W s − w su ) − ( W d − w du ) ≥ ( hcc su − ( W d − w du ) − nw du )where hcc su = ( n − w su + ( W s − w su ) is the total contribution u can have due to similarity weightsin any tree. By rearranging terms:( W s − w su ) + nw du ≥ hcc su + hcc du (6)where hcc du = ( W d − w du ) + nw du is the total contribution u can have due to dissimilarity edges inany tree.Let hcc u ( T ) be the contribution towards the hcc objective of node u in T and observe we caneasily compute this quantity as u got removed first. In other words, hcc u ( T ) = ( W s − w su ) + nw du ,as any dissimilarity edge ( u, · ) has a lowest common ancestor of size n and for every similarityedge ( i, j ) , i, j (cid:54) = u , u is a non-leaf of T ij . Summing up eq. (5) and (6), and noting that hcc ( T ) = hcc u ( T ) + hcc ( T (cid:48) ) concludes the proof. Proof of Theorem 6.3. A simple calculation suggests that the expected value for HCC is at least:min p (cid:8) p · + 0 . · (1 − p ) , p · + (1 − p ) · (cid:9) By balancing the two terms, the minimum is achieved when the parameter p = 1 − . and thefinal approximation factor becomes 0.4767. 23 roof of Theorem 6.4. There are two cases to consider: either (cid:80) e w de ≥ (cid:80) e w se or (cid:80) e w de ≤ (cid:80) e w se .We first consider the case that (cid:80) e w de ≥ (cid:80) e w se (the second is handled symmetrically). We rewritethe objective function for some HC tree T . hcc ± ( T ) = (cid:88) e w de ( T e ) + (cid:88) e w se ( n − T e )= (cid:88) e w de ( T e ) + (cid:88) e (1 − w de )( n − T e )= 2 (cid:88) e w de ( T e ) + (cid:88) e ( n − T e ) − n (cid:88) e w de = 2 (cid:88) e w de ( T e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de , where the last equality follows from Fact 2.3. We first observe that a tree that maximizes thedissimilarity instance defined by w de is a tree that maximizes the original HCC ± objective. Let O d denote the tree maximizing the dissimilarity objective and let O denote the tree maximizingthe HCC ± objective. By Theorem 4.4 we know that for any constant (cid:15) > ALG ) generates dissimilarity of at least (1 − (cid:15) ) (cid:80) e w de ( O de ) = (1 − (cid:15) ) (cid:80) e w de ( O e ).Therefore, for any (cid:15) > hcc ± ( ALG ) = 2 (cid:88) e w de ( ALG e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de ≥ − (cid:15) ) (cid:88) e w de ( O e )+ 13 n (cid:18) n (cid:19) − n (cid:88) e w de = (1 − (cid:15) ) (cid:88) e w de ( O e )+ (cid:88) e w de ( O e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de ≥ (1 − (cid:15) ) · (cid:0) (cid:88) e w de ( O e ) + 13 n (cid:18) n (cid:19) − n (cid:88) e w de (cid:1) = hcc ± ( O ) , where the last inequality follows from Fact 2.2.The case that (cid:80) e w de ≤ (cid:80) e w se is solved symmetrically (using Theorem 3.8 and Fact 2.1) whichconcludes the proof. F HARDNESS RESULTS Proof of Theorem 5.1. Note that clearly the problem is in NP (since given a tree its revenue maybe checked efficiently), therefore we only need to show that it is NP-hard.24hmadian et al. [2019] showed that the unweighted revenue case is APX-hard under the SmallSet Expansion hypothesis. This in turn guarantees that the unweighted revenue problem is NP-hardassuming the Small Set Expansion. Next we show how to reduce an unweighted revenue instanceto a dense unweighted revenue instance (in polynomial time).Roughly speaking we will simply add a disconnect clique of size n to the general graph. Formally,let G = ( D, E D , w ) denote a general revenue instance such that, D = { d , . . . , d n } . We convert G to a dense instance G (cid:48) = ( V, E V , w (cid:48) ) simply by adding a clique of size n (disconnected from V )with similarities of size 1. We denote this clique’s set of nodes by L = { (cid:96) , . . . , (cid:96) n } . Therefore, w (cid:48) ( (cid:96) i , (cid:96) j ) = 1 , w (cid:48) ( d i , d j ) = w ( d i , d j ) and w (cid:48) ( (cid:96) i , d j ) = 0.Clearly G (cid:48) is dense. Let T (cid:48) denote the optimal solution to G (cid:48) . It is known that the optimal treefirst cuts the disconnected components of G (cid:48) . Therefore, there exists a node u in T (cid:48) such that thesubtree rooted at u contains the entirety of L and no data points from D . Since D is disconnectedfrom L and due to the definition of the revenue goal function, taking u and moving it to the topof T (cid:48) (formally, if r (cid:48) is the root of T (cid:48) , then we create a new root, r and attach u and r (cid:48) as itsimmediate children), can only increase T (cid:48) ’s revenue. Thus, we may assume w.l.o.g. that in T (cid:48) theroot already disconnects L and D .Let v D and v L denote T (cid:48) ’s root’s immediate children containing D and L respectively. Let T (cid:48) D denote the subtree rooted at u D . T (cid:48) D is clearly optimal for instance G (since otherwise, we couldhave replaced T (cid:48) D with the optimal tree for G , thereby increasing T (cid:48) ’s revenue, contradicting thefact that it is optimal).Thus, we converted, in polynomial time, the optimal tree for G (cid:48) to the optimal tree for G ,proving that the dense revenue problem is NP-hard. Definition 6. We say that an unweighted graph is complement-dense if its complement graph (i.e.,the graph we get by removing all existing edges and adding all missing edges) is dense. Lemma F.1. The problem of finding a maximal revenue tree for revenue instances which arecomplement-dense is NP-complete (assuming the Small Set Expansion hypothesis).Proof. Note that clearly the problem is in NP (since given a tree its revenue may be checkedefficiently),therefore we only need to show that it is NP-hard.As in Theorem 5.1, we reduce an unweighted revenue instance to a complement-dense un-weighted revenue instance. Specifically we do this by adding a disconnected path of length n to the original graph. Formally, let G = ( D, E D , w ) denote a general revenue instance such that, D = { d , . . . , d n } . We convert G to a complement-dense instance G (cid:48) = ( V, E V , w (cid:48) ) simply by addinga path of size n (disconnected from V ) with similarities of size 1. We denote this path’s set ofnodes by L = { (cid:96) , . . . , (cid:96) n } . Therefore, w (cid:48) ( (cid:96) i , (cid:96) i +1 ) = 1 , w (cid:48) ( d i , d j ) = w ( d i , d j ) and w (cid:48) ( (cid:96) i , d j ) = 0.Note that G (cid:48) is clearly complement-dense.As in the proof of Theorem 5.1 exists a node u in the optimal solution of G (cid:48) , T (cid:48) , such that u contains the entirety of L and no data points from D . Again, we may move u and its subtree to theroot of T (cid:48) thereby only increasing the revenue. Thus, given T (cid:48) we may take its child that contains D as our optimal tree for G . Observation 5. Since the problem of finding a minimal (Dasgupta) cost tree is the dual problemof the revenue problem, the unweighted, complement-dense Dasgupta cost problem is NP-complete(assuming the Small Set Expansion hypothesis). roof of Theorem 5.2. Note that clearly the problem is in NP (since given a tree its dissimilaritymay be checked efficiently), therefore we only need to show that it is NP-hard. We do this byreducing the unweighted, complement-dense Dasgutpa cost problem to this problem.Roughly speaking we simply consider the complement graph of the HC instance. Formally, givena complement-dense HC instance G = ( V, E, w ) we define its complement as G c = ( V c , E c , w c ).Therefore, for any edge e , w c ( e ) = 1 − w ( e ). Thus,min T cost G ( T ) = min T (cid:88) w ( e ) | T e | = min T (cid:88) (1 − w c ( e )) | T e | . Dasgupta [2016] proved that for any binary tree T and for any HC instance which is a clique H itscost is fixed and cost H ( T ) = ( | V ( H ) | − | V ( H ) | ). Since the optimal tree for this cost function isin fact binary we get, min T (cid:88) (1 − w c ( e )) | T e | =13 ( | V ( G ) | − | V ( G ) | ) − max T (cid:88) w c ( e ) | T e | . Since w defines a complement-dense instance, w c defines a dense instance. Thus, we reducedour original problem to max T (cid:80) w c ( e ) | T e | such that w c is dense, thereby completing the proof. Proof of Theorem 5.3. The theorem is proven simply by rewriting the HCC ±±