Moses Charikar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Moses Charikar is active.

Explore More

Publication

Featured researches published by Moses Charikar.

symposium on the theory of computing | 2002

Similarity estimation techniques from rounding algorithms

Moses Charikar

(MATH) A locality sensitive hashing scheme is a distribution on a family

international colloquium on automata languages and programming | 2002

Finding Frequent Items in Data Streams

Moses Charikar; Kevin C. Chen

symposium on the theory of computing | 2000

Min-Wise Independent Permutations

Andrei Z. Broder; Moses Charikar; Alan M. Frieze; Michael Mitzenmacher

of hash functions operating on a collection of objects, such that for two objects x,y, PrhεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:<ol><li>A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.</li><li>A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) &xie; Ehε\F [d(h(P),h(Q))] &xie; O(log n log log n). EMD(P, Q).</li></ol>.

symposium on the theory of computing | 2002

A constant-factor approximation algorithm for the k -median problem

Moses Charikar; Sudipto Guha; Éva Tardos; David B. Shmoys

We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature.

Journal of the ACM | 2008

Aggregating inconsistent information: Ranking and clustering

Nir Ailon; Moses Charikar; Alantha Newman

We define and study the notion of min-wise independent families of permutations. We say that F?Sn (the symmetric group) is min-wise independent if for any set X?n and any x?X, when ? is chosen at random in F we havePr(min{?(X)}=?(x))=1|X| . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept?we present the solutions to some of them and we list the rest as open problems.

SIAM Journal on Computing | 2004

Incremental Clustering and Dynamic Information Retrieval

Moses Charikar; Chandra Chekuri; Tomás Feder; Rajeev Motwani

We present the first constant-factor approximation algorithm for the metric k-median problem. The k-median problem is one of the most well-studied clustering problems, i.e., those problems in which the aim is to partition a given set of points into clusters so that the points within a cluster are relatively close with respect to some measure. For the metric k-median problem, we are given n points in a metric space. We select k of these to be cluster centers and then assign each point to its closest selected center. If point j is assigned to a center i, the cost incurred is proportional to the distance between i and j. The goal is to select the k centers that minimize the sum of the assignment costs. We give a 62/3-approximation algorithm for this problem. This improves upon the best previously known result of O(log k log log k), which was obtained by refining and derandomizing a randomized O(log n log log n)-approximation algorithm of Bartal.

foundations of computer science | 1999

Improved combinatorial algorithms for the facility location and k-median problems

Moses Charikar; Sudipto Guha

We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the extent of disagreement with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.

symposium on the theory of computing | 1997

Incremental clustering and dynamic information retrieval

Moses Charikar; Chandra Chekuri; Tomás Feder; Rajeev Motwani

Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance, and which we believe should also perform well in practice. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters.

Journal of Algorithms | 1999

Approximation Algorithms for Directed Steiner Problems

Moses Charikar; Chandra Chekuri; To yat Cheung; Zuo Dai; Ashish Goel; Sudipto Guha; Ming Li

We present improved combinatorial approximation algorithms for the uncapacitated facility location and k-median problems. Two central ideas in most of our results are cost scaling and greedy improvement. We present a simple greedy local search algorithm which achieves an approximation ratio of 2.414+/spl epsiv/ in O/spl tilde/(n/sup 2///spl epsiv/) time. This also yields a bicriteria approximation tradeoff of (1+/spl gamma/, 1+2//spl gamma/) for facility cost versus service cost which is better than previously known tradeoffs and close to the best possible. Combining greedy improvement and cost scaling with a recent primal dual algorithm for facility location due to K. Jain and V. Vazirani (1999), we get an approximation ratio of 1.853 in O/spl tilde/(n/sup 3/) time. This is already very close to the approximation guarantee of the best known algorithm which is LP-based. Further combined with the best known LP-based algorithm for facility location, we get a very slight improvement in the approximation factor for facility location, achieving 1.728. We present improved approximation algorithms for capacitated facility location and a variant. We also present a 4-approximation for the k-median problem, using similar ideas, building on the 6-approximation of Jain and Vazirani. The algorithm runs in O/spl tilde/(n/sup 3/) time.

IEEE Transactions on Information Theory | 2005