Sudipto Guha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sudipto Guha is active.

Explore More

Publication

Featured researches published by Sudipto Guha.

international conference on management of data | 1998

CURE: an efficient clustering algorithm for large databases

Sudipto Guha; Rajeev Rastogi; Kyuseok Shim

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

IEEE Transactions on Knowledge and Data Engineering | 2003

Clustering data streams: Theory and practice

Sudipto Guha; Adam Meyerson; Nina Mishra; Rajeev Motwani; Liadan O'Callaghan

The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithms performance on synthetic and real data streams.

Information Systems | 2000

ROCK: a robust clustering algorithm for categorical attributes

Sudipto Guha; Rajeev Rastogi; Kyuseok Shim

Abstract Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

symposium on discrete algorithms | 1998

Greedy strikes back: improved facility location algorithms

Sudipto Guha; Samir Khuller

A fundamental facility location problem is to choose the location of facilities, such as industrial plants and warehouses, to minimize the cost of satisfying the demand for some commodity. There are associated costs for locating the facilities, as well as transportation costs for distributing the commodities. We assume that the transportation costs form a metric. This problem is commonly referred to as theuncapacitated facility locationproblem. Application to bank account location and clustering, as well as many related pieces of work, are discussed by Cornuejols, Nemhauser, and Wolsey. Recently, the first constant factor approximation algorithm for this problem was obtained by Shmoys, Tardos, and Aardal. We show that a simple greedy heuristic combined with the algorithm by Shmoys, Tardos, and Aardal, can be used to obtain an approximation guarantee of 2.408. We discuss a few variants of the problem, demonstrating better approximation factors for restricted versions of the problem. We also show that the problem is max SNP-hard. However, the inapproximability constants derived from the max SNP hardness are very close to one. By relating this problem to Set Cover, we prove a lower bound of 1.463 on the best possible approximation ratio, assumingNP?DTIMEnO(loglogn)].

international conference on data engineering | 2002

Streaming-data algorithms for high-quality clustering

Liadan O'Callaghan; Nina Mishra; Adam Meyerson; Sudipto Guha; Rajeev Motwani

Streaming data analysis has recently attracted attention in numerous applications including telephone records, Web documents and click streams. For such analysis, single-pass algorithms that consume a small amount of memory are critical. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithms performance on synthetic and real data streams.

Information Systems | 2001

Cure: an efficient clustering algorithm for large databases

Sudipto Guha; Rajeev Rastogi; Kyuseok Shim

Abstract Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning . A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

symposium on the theory of computing | 2002

A constant-factor approximation algorithm for the k -median problem

Moses Charikar; Sudipto Guha; Éva Tardos; David B. Shmoys

We present the first constant-factor approximation algorithm for the metric k-median problem. The k-median problem is one of the most well-studied clustering problems, i.e., those problems in which the aim is to partition a given set of points into clusters so that the points within a cluster are relatively close with respect to some measure. For the metric k-median problem, we are given n points in a metric space. We select k of these to be cluster centers and then assign each point to its closest selected center. If point j is assigned to a center i, the cost incurred is proportional to the distance between i and j. The goal is to select the k centers that minimize the sum of the assignment costs. We give a 62/3-approximation algorithm for this problem. This improves upon the best previously known result of O(log k log log k), which was obtained by refining and derandomizing a randomized O(log n log log n)-approximation algorithm of Bartal.

foundations of computer science | 1999

Improved combinatorial algorithms for the facility location and k-median problems

Moses Charikar; Sudipto Guha

We present improved combinatorial approximation algorithms for the uncapacitated facility location and k-median problems. Two central ideas in most of our results are cost scaling and greedy improvement. We present a simple greedy local search algorithm which achieves an approximation ratio of 2.414+/spl epsiv/ in O/spl tilde/(n/sup 2///spl epsiv/) time. This also yields a bicriteria approximation tradeoff of (1+/spl gamma/, 1+2//spl gamma/) for facility cost versus service cost which is better than previously known tradeoffs and close to the best possible. Combining greedy improvement and cost scaling with a recent primal dual algorithm for facility location due to K. Jain and V. Vazirani (1999), we get an approximation ratio of 1.853 in O/spl tilde/(n/sup 3/) time. This is already very close to the approximation guarantee of the best known algorithm which is LP-based. Further combined with the best known LP-based algorithm for facility location, we get a very slight improvement in the approximation factor for facility location, achieving 1.728. We present improved approximation algorithms for capacitated facility location and a variant. We also present a 4-approximation for the k-median problem, using similar ideas, building on the 6-approximation of Jain and Vazirani. The algorithm runs in O/spl tilde/(n/sup 3/) time.

Journal of Algorithms | 1999

Approximation Algorithms for Directed Steiner Problems

Moses Charikar; Chandra Chekuri; To yat Cheung; Zuo Dai; Ashish Goel; Sudipto Guha; Ming Li

We obtain the first non-trivial approximation algorithms for the Steiner Tree problem and the Generalized Steiner Tree problem in general directed graphs. Essentially no approximation algorithms were known for these problems. For the Directed Steiner Tree problem, we design a family of algorithms which achieve an approximation ratio of O(k^\epsilon) in time O(kn^{1/\epsilon}) for any fixed (\epsilon < 0), where k is the number of terminals to be connected. For the Directed Generalized Steiner Tree Problem, we give an algorithm which achieves an approximation ratio of O(k^{2/3}\log^{1/3} k), where k is the number of pairs to be connected. Related problems including the Group Steiner tree problem, the Node Weighted Steiner tree problem and several others can be reduced in an approximation preserving fashion to the problems we solve, giving the first non-trivial approximations to those as well.

symposium on the theory of computing | 2001

Data-streams and histograms

Sudipto Guha; Nick Koudas; Kyuseok Shim

Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + ε approximation of optimal histograms, running in polylogarithmic space. Our results extend to the context of data-streams, and in fact generalize to give 1 + ε approximation of several problems in data-streams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal data-stream algorithms.

Explore More