[PDF] Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces

Abstract

In many applications such as web-based search, document summarization, facility location and other applications, the results are preferable to be both representative and diversified subsets of documents. The goal of this study is to select a good "quality", bounded-size subset of a given set of items, while maintaining their diversity relative to a semi-metric distance function. This problem was first studied by Borodin et al\cite{borodin}, but a crucial property used throughout their proof is the triangle inequality. In this modified proof, we want to relax the triangle inequality and relate the approximation ratio of max-sum diversification problem to the parameter of the relaxed triangle inequality in the normal form of the problem (i.e., a uniform matroid) and also in an arbitrary matroid.

Full PDF

aa r X i v : . [ c s . L G ] N ov Max-Sum Diversiﬁcation, MonotoneSubmodular Functions and Semi-metric Spaces

Sepehr Abbasi Zadeh and Mehrdad Ghadiri { sabbasizadeh, ghadiri } @ce.sharif.edu School of Computer Engineering, Sharif University of Technology, Iran

Abstract.

In many applications such as web-based search, documentsummarization, facility location and other applications, the results arepreferable to be both representative and diversiﬁed subsets of documents.The goal of this study is to select a good “quality”, bounded-size subsetof a given set of items, while maintaining their diversity relative to a semi-metric distance function. This problem was ﬁrst studied by Borodin etal [1], but a crucial property used throughout their proof is the triangleinequality. In this modiﬁed proof we want to relax the triangle inequalityand relate the approximation ratio of max-sum diversiﬁcation problemto the parameter of the relaxed triangle inequality in the normal form ofthe problem (i.e., a uniform matroid) and also in an arbitrary matroid.

Introduction

In many search applications, the search engine should guess the correct resultsfrom a given query; therefore, it is important to deliver a diversiﬁed and repre-sentative set of documents to a user. Diversiﬁcation can be viewed as a trade-oﬀbetween having more relevant results and having more diverse results among thetop results for a given query [3]. “Jaguar” is a cliche example in the diversiﬁ-cation literature [2, 4, 9], but it illustrates the point perfectly as it has diﬀerentmeanings including car, animal, and a football team. A set of good “quality”result should cover all these diversiﬁed items. The paper by Borodin et al [1]determines the good quality results with a monotone submodular function anddeﬁnes diversity as the sum of distances between selected objects. Since theyconsider the distances to be metric, they ask in the conclusion section:For a relaxed version of the triangle inequality can we relate the approx-imation ratio to the parameter of a relaxed triangle inequality?In this study we answer to this question. We call this relaxed triangle in-equality distance as semi-metric. A semi-metric distance on a set of items is justlike a metric distance, but the triangle inequality is relaxed with a parameter α ≥ d ( u, v ) ≤ α ( d ( v, w ) + d ( w, u ))). Answering to this question will makethis method applicable to algorithms that are deﬁned on semi-metric spaces,e.g., [5, 7, 8]. The IBM’s Query by Image Content system is one of the otherbest-known examples of the semi-metric usage in practice; although, it does not S. Abbasi Zadeh and M. Ghadiri satisfy the triangle inequality [6]. By modifying the analysis of the previous pro-posed algorithms in [1], we will show that these algorithms can still achieve a2 α -approximation for this question in the case that there is not any matroid con-straint and a 2 α -approximation for an arbitrary matroid constraint. In otherwords, these new modiﬁed analysis are a generalization of the previous analysisas they are consistent with the previous approximation ratios for α = 1 (i.e., themetric distance). Problem 1. Max-Sum Diversiﬁcation

Let U be the underlying ground set, and let d ( ., . ) be a semi-metric distancefunction on U . The goal of the problem is to ﬁnd a subset S ⊆ U that:maximizes f ( S ) + λ P { u,v } : u,v ∈ S d ( u, v )subject to | S | = p ,where p is a given constant number and λ is a parameter specifying a trade-oﬀbetween the distance and submodular function. We give a 2 α − approximationfor this problem.Firstly we introduce our notations following [1]. For any S ⊆ U , we let d ( S ) = P { u,v } : u,v ∈ S d ( u, v ). We can also deﬁne d ( S, T ), for any two disjoint sets S and T as: d ( S ∪ T ) − d ( S ) − d ( T ) . Let φ ( S ) and u be the value of the objective function and an element in U − S respectively. We can deﬁne the marginal gain of the distance function as d u ( S ) = P v ∈ S d ( u, v )and similarly marginal gain of the wight function as: f u ( S ) = f ( S + u ) − f ( S ) . The total marginal gain can also be deﬁned using d u ( S ) and f u ( S ) as φ u ( S ) = f u ( S ) + λd u ( S ) . Let f ′ u ( S ) = 12 f u ( S ) ,φ ′ u ( S ) = f ′ u ( S ) + λd u ( S ) . Starting with an empty set S , the greedy algorithm (Algorithm 1) adds anelement u from U − S in each iteration, in such a way that maximize φ ′ u ( S ). Lemma 1.

Given an α -relaxed triangle inequality semi-metric distance function d ( ., . ) , and two disjoint sets X and Y , we have the following inequality: α ( | X | − d ( X, Y ) ≥ | Y | d ( X ) ax-Sum Diversiﬁcation and Semi-metric Spaces 3 Algorithm 1

Greedy algorithm Input U : set of ground elements3: p : size of ﬁnal set4: Output S : set of selected elements with size p S = ∅ while | S | < p do ﬁnd u ∈ U \ S maximizing φ ′ u ( S )9: S = S ∪ { u } end while return S Proof.

Consider u, v ∈ X and an arbitrary w ∈ Y . We know that: α ( d ( v, w ) + d ( w, u )) ≥ d ( u, v )By changing w we get: α ( d ( { v } , Y ) + d ( { u } , Y )) ≥ | Y | d ( u, v )and then all combinations of u and v : α ( | X | − d ( X, Y ) ≥ | Y | d ( X ) Theorem 1.

Algorithm 1 achieves a α -approximation for solving Problem 1with α -relaxed distance d ( ., . ) and monotone submodular function f .Proof. Let G i be the greedy solution at the end of step i , i < p and G bethe greedy solution at the end of the algorithm. Suppose that O is the optimalsolution and let A = O ∩ G i , B = G i \ A and C = O \ A . Obviously thealgorithm achieves the optimal solution when p = 1; thus we assume p >

1. Nowwe consider two diﬀerent cases: | C | = 1 and | C | >

1. If | C | = 1 then i = p − C = { v } and u be the element that algorithm will take for the next (last)step. Then for all v ∈ U \ S we have: φ ′ u ( G i ) ≥ φ ′ v ( G i ) f ′ u ( G i ) + λd u ( G i ) ≥ f ′ v ( G i ) + λd v ( G i )thus: φ u ( G i ) = f u ( G i ) + λd u ( G i ) ≥ f ′ u ( G i ) + λd u ( G i ) ≥ f ′ v ( G i ) + λd v ( G i ) ≥ φ v ( G i ) S. Abbasi Zadeh and M. Ghadiri as a result φ ( G ) ≥ φ ( O ) ≥ α φ ( O ).Now consider | C | >

1. By using Lemma 1 we have the following inequalities: α ( | C | − d ( B, C ) ≥ | B | d ( C ) (1) α ( | C | − d ( A, C ) ≥ | A | d ( C ) (2) α ( | A | − d ( A, C ) ≥ | C | d ( A ) (3) A and C are two disjoint sets and we know that A ∪ C = O ; thus: d ( A, C ) + d ( A ) + d ( C ) = d ( O ) (4)We can assume that p > | C | > p = 1). Then following multipliers are applied toequations 1, 2, 3, 4 respectively: | C |− , | C |−| B | p ( | C |− , ip ( p − , i | C | αp ( p − .If we add them, we have: d ( B, C ) + d ( A, C ) − d ( A, C ) i | C | (1 − α ) p ( p − − d ( C ) i | C | ( p − | C | ) αp ( p − | C | − ≥ d ( O ) i | C | αp ( p − p > | C | and α ≥ d ( A, C ) + d ( B, C ) ≥ d ( O ) i | C | αp ( p − . thus (we substituted α with x , thus 0 < x ≤ d ( C, G i ) ≥ d ( O ) xi | C | p ( p − f ′ ( . ) we can get X v ∈ C f ′ v ( G i ) ≥ f ′ ( C ∪ G i ) − f ′ ( G i )also the monotonity of f ′ ( . ) suggests that f ′ ( C ∪ G i ) − f ′ ( G i ) ≥ f ′ ( O ) − f ′ ( G ) . Subsequently we have: X v ∈ C f ′ v ( G i ) ≥ f ′ ( O ) − f ′ ( G ) . ax-Sum Diversiﬁcation and Semi-metric Spaces 5 Therefore X v ∈ C φ ′ v ( G i ) = X v ∈ C [ f ′ v ( G i ) + λd ( { v } , G i )]= X v ∈ C f ′ v ( G i ) + λd ( C, G i ) ≥ [ f ′ ( O ) − f ′ ( G )] + d ( O ) λxi | C | p ( p − . Let u i +1 be the element taken at step ( i + 1), then we have φ ′ u i +1 ( G i ) ≥ p [ f ′ ( O ) − f ′ ( G )] + d ( O ) λxip ( p − . If we sum over all i from 0 to p −

1, we have φ ′ ( G ) = p − X i =0 φ ′ u i +1 ( G i ) ≥ [ f ′ ( O ) − f ′ ( G )] + d ( O ) λx f ′ ( G ) + λd ( G ) ≥ f ′ ( O ) − f ′ ( G ) + d ( O ) λx φ ( G ) = f ( G ) + λd ( G ) ≥

12 [ f ( O ) + xλd ( O )] ≥ x f ( O ) + λd ( O )]= 12 α φ ( O ) . ⊓⊔ Problem 2. Max-Sum Diversiﬁcation for Matroids

Let U be the underlying ground set, and F be the set of independent subsetsof U such that M = < U, F > is a matroid. Let d ( ., . ) be a semi-metric distancefunction on U and f ( . ) be a non-negative monotone submodular set functionmeasuring the weight of the subsets of U . This problem aims to ﬁnd a subset S ⊆ F that: maximizes f ( S ) + λ P { u,v } : u,v ∈ S d ( u, v )where λ is a parameter specifying a trade-oﬀ between the two objectives. Again, φ ( S ) is the value of the objective function. Because of the monotonicity of the φ ( . ), S should be a basis of the matroid M . We give a 2 α − approximation forthis problem. S. Abbasi Zadeh and M. Ghadiri

Without loss of generality, we assume that the rank of the matroid is greaterthan one. Let { x, y } = argmax x,y ∈F [ f ( { x, y } ) + λd ( x, y )] . We now consider the following local search algorithm:

Algorithm 2

Local Search algorithm Input U : set of ground elements3: M = < U , F > : a matroid on U S : a basis of M containing both x and y Output S while ∃{ u ∈ ( U − S ) ∧ v ∈ S } such that S + u − v ∈ F ∧ φ ( S + u − v ) > φ ( S ) do S = S + u − v end while return S Theorem 2.

Algorithm 2 achieves an approximation ratio of α for max-sumdiversiﬁcation with a matroid constraint. As the algorithm is optimal for the case that the rank of the matroid is two,we assume that the rank of the matroid is greater than two. The notation is likebefore and O and S are the optimal solution and the solution at the end of thelocal search algorithm, respectively. Let A = O ∩ S , B = S − A and C = O − A .We utilize the following two lemmas from the [1]. Lemma 2.

For any two sets

X, Y ∈ F with | X | = | Y | , there is a bijectivemapping g : X → Y such that X − x + g ( x ) ∈ F for any x ∈ X. Since both S and O are bases of the matroid, they have the same cardinality;subsequently, B and C have the same cardinality, too. Let g : B → C be thebijective mapping results from Lemma 2 such that S − b + g ( b ) ∈ F for any b ∈ B . Let B = { b , b , ..., b t } , and let c i = g ( b i ) for all i . As claimed before,since the algorithm is optimal for t = 1, we assume t ≥ Lemma 3. P ti =1 f ( S − b i + c i ) ≥ ( t − f ( S ) + f ( O ) . Now we are going to prove two lemmas regarding to our semi-metric distancefunction.

Lemma 4. If t > , α ( d ( B, C ) − P ti =1 d ( b i , c i )) ≥ d ( C ) . Proof.

For any b i , c j , c k , we have α ( d ( b i , c j ) + d ( b i , c k )) ≥ d ( c j , c k ) . ax-Sum Diversiﬁcation and Semi-metric Spaces 7 Summing up these inequalities over all i, j, k with i = j , i = k , j = k , we haveeach d ( b i , c j ) with i = j is counted ( t −

2) times; and each d ( c i , c j ) with i = j iscounted ( t −

2) times. Therefore α ( t − d ( B, C ) − t X i =1 d ( b i , c i )] ≥ ( t − d ( C ) , and the lemma follows. Lemma 5. P ti =1 d ( S − b i + c i ) ≥ ( t − d ( S ) + α d ( O ) . Proof. t X i =1 d ( S − b i + c i ) = t X i =1 [ d ( S ) + d ( c i , S − b i ) − d ( b i , S − b i )]= td ( S ) + t X i =1 d ( c i , S − b i ) − t X i =1 d ( b i , S − b i )= td ( S ) + t X i =1 d ( c i , S ) − t X i =1 d ( c i , b i ) − t X i =1 d ( b i , S − b i )= td ( S ) + d ( C, S ) − t X i =1 d ( c i , b i ) − d ( A, B ) − d ( B ) . There are two cases. If t > d ( C, S ) − t X i =1 d ( c i , b i ) = d ( A, C ) + d ( B, C ) − t X i =1 d ( c i , b i ) ≥ d ( A, C ) + 1 α d ( C ) . We know that d ( S ) = d ( A ) + d ( B ) + d ( A, B )thus we have 2 d ( S ) − d ( A, B ) − d ( B ) ≥ d ( A ) . S. Abbasi Zadeh and M. Ghadiri

Therefore t X i =1 d ( S − b i + c i ) = td ( S ) + d ( C, S ) − t X i =1 d ( c i , b i ) − d ( A, B ) − d ( B ) ≥ ( t − d ( S ) + d ( A, C ) + 1 α d ( C ) + d ( A ) ≥ ( t − d ( S ) + 1 α d ( O )if t = 2, then since the rank of the matroid is greater than two, A = ∅ . Let z bean element in A , then we have2 d ( S ) + d ( C, S ) − t X i =1 d ( c i , b i ) − d ( A, B ) − d ( B )= d ( A, C ) + d ( B, C ) − t X i =1 d ( c i , b i ) + 2 d ( A ) + d ( A, B ) ≥ d ( A, C ) + d ( c , b ) + d ( c , b ) + d ( A ) + d ( z, b ) + d ( z, b ) ≥ d ( A, C ) + d ( A ) + 1 α d ( c , z ) + 1 α d ( c , z ) ≥ d ( A, C ) + d ( A ) + 1 α d ( c , c ) ≥ α ( d ( A, C ) + d ( A ) + d ( C )) ≥ α d ( O )Therefore t X i =1 d ( S − b i + c i )= td ( S ) + d ( C, S ) − t X i =1 d ( c i , b i ) − d ( A, B ) − d ( B ) ≥ ( t − d ( S ) + 1 α d ( O ) . This completes the proof.Now we can complete the proof of Theorem 2. ax-Sum Diversiﬁcation and Semi-metric Spaces 9

Proof.

Since S is a locally optimal solution, we have φ ( S ) ≥ φ ( S − b i + c i ) forall i . Therefore for all i we have f ( S ) + λd ( S ) ≥ f ( S − b i + c i ) + λd ( S − b i + c i )Summing up over all i , we have tf ( S ) + λtd ( S ) ≥ t X i =1 f ( S − b i + c i ) + λ t X i =1 d ( S − b i + c i )By Lemma 3 we know tf ( S ) + λtd ( S ) ≥ ( t − f ( S ) + f ( O ) + λ t X i =1 d ( S − b i + c i )Then by Lemma 5 we have tf ( S ) + λtd ( S ) ≥ ( t − f ( S ) + f ( O ) + λ ( t − d ( S ) + λα d ( O )Therefore, 2 f ( S ) + 2 λd ( S ) ≥ f ( O ) + λα d ( O )Since α ≥

1, 2 f ( S ) + 2 λd ( S ) ≥ f ( O ) + λα d ( O ) ≥ α φ ( O ) φ ( S ) ≥ α φ ( O ) . ⊓⊔ Conclusion

In this study we answer a proposed question in [1] about the existence of abound on max-sum diversiﬁcation problem with semi-metric distances and givea 2 α -approximation for this question in the case that there is not any matroidconstraint and a 2 α -approximation for an arbitrary matroid constraint. Oneinteresting question that may be posed is whether it is possible to prove similarresults for a non-monotone submodular function? References

1. A. Borodin, H. C. Lee, and Y. Ye. Max-sum diversiﬁcation, monotone submodularfunctions and dynamic updates. In

Proceedings of the 31st symposium on Principlesof Database Systems , pages 155–166. ACM, 2012.2. C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. Web search usingautomatic classiﬁcation. In

Proceedings of the Sixth International Conference onthe World Wide Web , 1997.0 S. Abbasi Zadeh and M. Ghadiri3. H. Chen and D. R. Karger. Less is more: probabilistic models for retrieving fewerrelevant documents. In

Proceedings of the 29th annual international ACM SIGIRconference on Research and development in information retrieval , pages 429–436.ACM, 2006.4. C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. B¨uttcher,and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In

Proceedings of the 31st annual international ACM SIGIR conference on Researchand development in information retrieval , pages 659–666. ACM, 2008.5. R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists.

SIAM Journal onDiscrete Mathematics , 17(1):134–160, 2003.6. R. Fagin and L. Stockmeyer. Relaxing the triangle inequality in pattern matching.

International Journal of Computer Vision , 30(3):219–231, 1998.7. L. O’callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-dataalgorithms for high-quality clustering. In icde , page 0685. IEEE, 2002.8. R. C. Veltkamp. Shape matching: Similarity measures and algorithms. In

ShapeModeling and Applications, SMI 2001 International Conference on. , pages 188–197.IEEE, 2001.9. X. Wang and C. Zhai. Learn from web search logs to organize search results. In