Moses Charikar
Princeton University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Moses Charikar.
symposium on the theory of computing | 2002
Moses Charikar
(MATH) A locality sensitive hashing scheme is a distribution on a family
international colloquium on automata languages and programming | 2002
Moses Charikar; Kevin C. Chen
\F
symposium on the theory of computing | 2000
Andrei Z. Broder; Moses Charikar; Alan M. Frieze; Michael Mitzenmacher
of hash functions operating on a collection of objects, such that for two objects <i>x,y</i>, <b>Pr</b><sub><i>h</i></sub>εF[<i>h</i>(<i>x</i>) = <i>h</i>(<i>y</i>)] = sim(<i>x,y</i>), where <i>sim</i>(<i>x,y</i>) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure <i>sim</i>(<i>A,B</i>) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:<ol><li>A collection of vectors with the distance between → \over <i>u</i> and → \over <i>v</i> measured by Ø(→ \over <i>u</i>, → \over <i>v</i>)/π, where Ø(→ \over <i>u</i>, → \over <i>v</i>) is the angle between → \over <i>u</i>) and → \over <i>v</i>). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.</li><li>A collection of distributions on <i>n</i> points in a metric space, with distance between distributions measured by the Earth Mover Distance (<b>EMD</b>), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions <i>P</i> and <i>Q</i>, <b>EMD</b>(<i>P,Q</i>) &xie; <b>E</b><sub>hε\F</sub> [<i>d</i>(<i>h</i>(<i>P</i>),<i>h</i>(<i>Q</i>))] &xie; <i>O</i>(log <i>n</i> log log <i>n</i>). <b>EMD</b>(<i>P, Q</i>).</li></ol>.
symposium on the theory of computing | 2002
Moses Charikar; Sudipto Guha; Éva Tardos; David B. Shmoys
We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature.
Journal of the ACM | 2008
Nir Ailon; Moses Charikar; Alantha Newman
We define and study the notion of min-wise independent families of permutations. We say that F?Sn (the symmetric group) is min-wise independent if for any set X?n and any x?X, when ? is chosen at random in F we havePr(min{?(X)}=?(x))=1|X| . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept?we present the solutions to some of them and we list the rest as open problems.
SIAM Journal on Computing | 2004
Moses Charikar; Chandra Chekuri; Tomás Feder; Rajeev Motwani
We present the first constant-factor approximation algorithm for the metric k-median problem. The k-median problem is one of the most well-studied clustering problems, i.e., those problems in which the aim is to partition a given set of points into clusters so that the points within a cluster are relatively close with respect to some measure. For the metric k-median problem, we are given n points in a metric space. We select k of these to be cluster centers and then assign each point to its closest selected center. If point j is assigned to a center i, the cost incurred is proportional to the distance between i and j. The goal is to select the k centers that minimize the sum of the assignment costs. We give a 62/3-approximation algorithm for this problem. This improves upon the best previously known result of O(log k log log k), which was obtained by refining and derandomizing a randomized O(log n log log n)-approximation algorithm of Bartal.
foundations of computer science | 1999
Moses Charikar; Sudipto Guha
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the extent of disagreement with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
symposium on the theory of computing | 1997
Moses Charikar; Chandra Chekuri; Tomás Feder; Rajeev Motwani
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance, and which we believe should also perform well in practice. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters.
Journal of Algorithms | 1999
Moses Charikar; Chandra Chekuri; To yat Cheung; Zuo Dai; Ashish Goel; Sudipto Guha; Ming Li
We present improved combinatorial approximation algorithms for the uncapacitated facility location and k-median problems. Two central ideas in most of our results are cost scaling and greedy improvement. We present a simple greedy local search algorithm which achieves an approximation ratio of 2.414+/spl epsiv/ in O/spl tilde/(n/sup 2///spl epsiv/) time. This also yields a bicriteria approximation tradeoff of (1+/spl gamma/, 1+2//spl gamma/) for facility cost versus service cost which is better than previously known tradeoffs and close to the best possible. Combining greedy improvement and cost scaling with a recent primal dual algorithm for facility location due to K. Jain and V. Vazirani (1999), we get an approximation ratio of 1.853 in O/spl tilde/(n/sup 3/) time. This is already very close to the approximation guarantee of the best known algorithm which is LP-based. Further combined with the best known LP-based algorithm for facility location, we get a very slight improvement in the approximation factor for facility location, achieving 1.728. We present improved approximation algorithms for capacitated facility location and a variant. We also present a 4-approximation for the k-median problem, using similar ideas, building on the 6-approximation of Jain and Vazirani. The algorithm runs in O/spl tilde/(n/sup 3/) time.
IEEE Transactions on Information Theory | 2005
Moses Charikar; Eric Lehman; Ding Liu; Rina Panigrahy; Manoj Prabhakaran; Amit Sahai; Abhi Shelat
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance, and which we believe should also perform well in practice. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters.