Tapas Kanungo
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tapas Kanungo.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2002
Tapas Kanungo; David M. Mount; Nathan S. Netanyahu; Christine D. Piatko; Ruth Silverman; Angela Y. Wu
In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyds (1982) algorithm. We present a simple and efficient implementation of Lloyds k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithms running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.
international world wide web conferences | 2003
Stephen Dill; Nadav Eiron; David Gibson; Daniel Gruhl; Ramanathan V. Guha; Anant Jhingran; Tapas Kanungo; Sridhar Rajagopalan; Andrew Tomkins; John A. Tomlin; Jason Y. Zien
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
symposium on computational geometry | 2002
Tapas Kanungo; David M. Mount; Nathan S. Netanyahu; Christine D. Piatko; Ruth Silverman; Angela Y. Wu
In k-means clustering we are given a set of n data points in d-dimensional space Rd and an integer k, and the problem is to determine a set of k points in ÓC;d, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the extremely high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+ε)-approximation algorithm. We show that the approximation factor is almost tight, by giving an example for which the algorithm achieves an approximation factor of (9-ε). To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyds algorithm, this heuristic performs quite well in practice.
document recognition and retrieval | 2003
Song Mao; Azriel Rosenfeld; Tapas Kanungo
Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
symposium on computational geometry | 2000
Tapas Kanungo; David M. Mount; Nathan S. Netanyahu; Christine D. Piatko; Ruth Silverman; Angela Y. Wu
Abstract : K-means clustering is a very popular clustering technique which is used in numerous applications. Given a set of n data points in R(exp d) and an integer k, the problem is to determine a set of k points R(exp d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyds algorithm. In this paper, we present a simple and efficient implementation of Lloyds k-means clustering algorithm, which we call the filtering algorithm. This algorithm is very easy to implement. It differs from most other approaches in that it precomputes a kd-tree data structure for the data points rather than the center points. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithms running time. Second, we have implemented the algorithm and performed a number of empirical studies, both on synthetically generated data and on real data from applications in color quantization, compression, and segmentation.
international conference on document analysis and recognition | 1993
Tapas Kanungo; Robert M. Haralick; Ihsin T. Phillips
Two sources of document degradation are modeled: i) perspective distortion that occurs while photocopying or scanning thick, bound documents, and ii) degradation due to perturbations in the optical scanning and digitization process: speckle, blurr, jitter, thresholding. Perspective distortion is modeled by studying the underlying perspective geometry of the optical system of photocopiers and scanners. An illumination model is described to account for the nonlinear intensity change occuring across a page in a perspective-distorted document. The optical distortion process is modeled morphologically. First, a distance transform on the foreground is performed, followed by a random inversion of binary pixels where the probability of flip is a function of the distance of the pixel to the boundary of the foreground. Correlating the flipped pixels is modeled by a morphological closing operation.<<ETX>>
Journal of Web Semantics | 2003
Stephen Dill; Nadav Eiron; David Gibson; Daniel Gruhl; Ramanathan V. Guha; Anant Jhingran; Tapas Kanungo; Kevin S. McCurley; Sridhar Rajagopalan; Andrew Tomkins; John A. Tomlin; Jason Y. Zien
Abstract This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date. We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large-scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2000
Tapas Kanungo; Robert M. Haralick; Henry S. Baird; Werner Stuezle; David Madigan
Printing, photocopying, and scanning processes degrade the image quality of a document. Statistical models of these degradation processes are crucial for document image understanding research. In this paper, we present a statistical methodology that can be used to validate local degradation models. This method is based on a nonparametric, two-sample permutation test. Another standard statistical device, the power function, is then used to choose between algorithm variables such as distance functions. Since the validation and the power function procedures are independent of the model, they can be used to validate any other degradation model. A method for comparing any two models is also described. It uses p-values associated with the estimated models to select the model that is closer to the real world.
IEEE Transactions on Image Processing | 1995
Tapas Kanungo; Mysore Y. Jaisimha; John Palmer; Robert M. Haralick
We present a methodology for the quantitative performance evaluation of detection algorithms in computer vision. A common method is to generate a variety of input images by varying the image parameters and evaluate the performance of the algorithm, as algorithm parameters vary. Operating curves that relate the probability of misdetection and false alarm are generated for each parameter setting. Such an analysis does not integrate the performance of the numerous operating curves. We outline a methodology for summarizing many operating curves into a few performance curves. This methodology is adapted from the human psychophysics literature and is general to any detection algorithm. The central concept is to measure the effect of variables in terms of the equivalent effect of a critical signal variable, which in turn facilitates the determination of the breakdown point of the algorithm. We demonstrate the methodology by comparing the performance of two-line detection algorithms.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2001
Song Mao; Tapas Kanungo
While numerous page segmentation algorithms have been proposed in the literature, there is lack of comparative evaluation of these algorithms. In the existing performance evaluation methods, two crucial components are usually missing: 1) automatic training of algorithms with free parameters and 2) statistical and error analysis of experimental results. We use the following five-step methodology to quantitatively compare the performance of page segmentation algorithms: 1) first, we create mutually exclusive training and test data sets with groundtruth, 2) we then select a meaningful and computable performance metric, 3) an optimization procedure is then used to search automatically for the optimal parameter values of the segmentation algorithms on the training data set, 4) the segmentation algorithms are then evaluated on the test data set, and, finally, 5) a statistical and error analysis is performed to give the statistical significance of the experimental results. In particular, instead of the ad hoc and manual approach typically used in the literature for training algorithms, we pose the automatic training of algorithms as an optimization problem and use the simplex algorithm to search for the optimal parameter value. A paired-model statistical analysis and an error analysis are then conducted to provide confidence intervals for the experimental results of the algorithms. This methodology is applied to the evaluation of live page segmentation algorithms of which, three are representative research algorithms and the other two are well-known commercial products, on 978 images from the University of Washington III data set. It is found that the performance indices of the Voronoi, Docstrum, and Caere segmentation algorithms are not significantly different from each other, but they are significantly better than that of ScanSofts segmentation algorithm, which, in turn, is significantly better than that of X-Y cut.