Raghu Krishnapuram | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raghu Krishnapuram is active.

Explore More

Publication

Featured researches published by Raghu Krishnapuram.

ieee international conference on fuzzy systems | 2003

Fuzzy co-clustering of documents and keywords

Krishna Kummamuru; Ajay Dhawale; Raghu Krishnapuram

Conventional clustering algorithms such as K-means and SAHN (also known as AHC) have been well studied and used in the information retrieval community for clustering text documents. More recently, efforts have been made to cluster documents and words simultaneously. The FCCM algorithm due to Oh et al. is a fuzzy clustering algorithm that maximizes the co-occurrence of categorical attributes (keywords) and the individual patterns (documents) in clusters. However, this algorithm poses certain problems when the number of documents or the number of words is very large. In this paper, we modify the FCCM algorithm so that it can be used to cluster large text corpora. Our experiments show that the modified algorithm is scalable and produces meaningful clusters. We also show the relation between FCCM and the Spherical K-Means (SKM) algorithm and introduce the Spherical Fuzzy c-Means (SFCM) algorithm.

Lecture Notes in Computer Science | 2003

Automatic taxonomy generation: issues and possibilities

Raghu Krishnapuram; Krishna Kummamuru

Automatic taxonomy generation deals with organizing text documents in terms of an unknown labeled hierarchy. The main issues here are (i) how to identify documents that have similar content, (ii) how to discover the hierarchical structure of the topics and subtopics, and (iii) how to find appropriate labels for each of the topics and subtopics. In this paper, we review several approaches to automatic taxonomy generation to provide an insight into the issues involved. We also describe how fuzzy hierarchies can overcome some of the problems associated with traditional crisp taxonomies.

conference on information and knowledge management | 2001

A clustering algorithm for asymmetrically related data with applications to text mining

K. Krishna; Raghu Krishnapuram

Clustering techniques find a collection of subsets of a data set such that the collection satisfies a criterion that is dependent on a relation defined on the data set. The underlying relation is traditionally assumed to be symmetric. However, there exist many practical scenarios where the underlying relation is asymmetric. One example of an asymmetric relation in text analysis is the inclusion relation, i.e., the inclusion of the meaning of a block of text in the meaning of another block. In this paper, we consider the general problem of clustering of asymmetrically related data and propose an algorithm to cluster such data. To demonstrate its usefulness, we consider two applications in text mining: (1) summarization of short documents, and (2) generation of a concept hierarchy from a set of documents. Our experiments show that the performance of the proposed algorithm is superior to that of more traditional algorithms.

international conference on machine learning | 2005

A model for handling approximate, noisy or incomplete labeling in text classification

Ganesh Ramakrishnan; Krishna Prasad Chitrapura; Raghu Krishnapuram; Pushpak Bhattacharyya

We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of text documents, the model estimates the joint distribution of training documents and class labels by using a generalization of the Expectation Maximization algorithm. The estimates can be used in standard classification models to reduce error rates. Since uncertainties in the labeling are taken into account, the model provides an elegant mechanism to deal with noisy labels. We provide an intuitive modification to the EM iterations by re-estimating the empirical. distribution in order to reinforce feature values in unlabeled data and to reduce the influence of noisily labeled examples. Considerable improvement in the classification accuracies of two popular classification algorithms on standard labeled data-sets with and without artificially introduced noise, as well as in the presence and absence of unlabeled data, indicates that this may be a promising method to reduce the burden of manual labeling.

international conference on data mining | 2005

On learning asymmetric dissimilarity measures

Krishna Kummamuru; Raghu Krishnapuram; Rakesh Agrawal

Many practical applications require that distance measures to be asymmetric and context-sensitive. We introduce context-sensitive learnable asymmetric dissimilarity (CLAD) measures, which are defined to be a weighted sum of a fixed number of dissimilarity measures where the associated weights depend on the point from which the dissimilarity is measured. The parameters used in defining the measure capture the global relationships among the features. We provide an algorithm to learn the dissimilarity measure automatically from a set of user specified comparisons in the form x is closer to y than to z and study its performance. The experimental results show that the proposed algorithm outperforms other approaches due to the context sensitive nature of the CLAD measures.

knowledge discovery and data mining | 2004

Learning spatially variant dissimilarity (SVaD) measures

Krishna Kummamuru; Raghu Krishnapuram; Rakesh Agrawal

Clustering algorithms typically operate on a feature vector representation of the data and find clusters that are compact with respect to an assumed (dis)similarity measure between the data points in feature space. This makes the type of clusters identified highly dependent on the assumed similarity measure. Building on recent work in this area, we formally define a class of spatially varying dissimilarity measures and propose algorithms to learn the dissimilarity measure automatically from the data. The idea is to identify clusters that are compact with respect to the unknown spatially varying dissimilarity measure. Our experiments show that the proposed algorithms are more stable and achieve better accuracy on various textual data sets when compared with similar algorithms proposed in the literature.

international conference on data engineering | 2004

EShopMonitor: a Web content monitoring tool

Neeraj Agrawal; Rema Ananthanarayanan; Rahul Gupta; Sachindra Joshi; Raghu Krishnapuram; Sumit Negi

Data presented on commerce sites runs into thousands of pages, and is typically delivered from multiple back-end sources. This makes it difficult to identify incorrect, anomalous, or interesting data such as

acm conference on hypertext | 2004

Automatic categorization of web sites based on source types

Shourya Roy; Sachindra Joshi; Raghu Krishnapuram

9.99 air fares, missing links, drastic changes in prices and addition of new products or promotions. We describe a system that monitors Web sites automatically and generates various types of reports so that the content of the site can be monitored and the quality maintained. The solution designed and implemented by us consists of a site crawler that crawls dynamic pages, an information miner that learns to extract useful information from the pages based on examples provided by the user, and a reporter that can be configured by the user to answer specific queries. The tool can also be used for identifying price trends and new products or promotions at competitor sites. A pilot run of the tool has been successfully completed at the ibm.com site.

extending database technology | 2009

Efficient skyline retrieval with arbitrary similarity measures

Deepak P; Prasad M. Deshpande; Debapriyo Majumdar; Raghu Krishnapuram

An important issue with the Web is verification of the accuracy, currency and authenticity of the information associated with Web sites. One way to address this problem is to identify the source or sponsor of the Web site. However, source identification is non-trivial because the source of a Web site cannot always be determined by the URL or content of the site. In this paper, we propose a method for source identification that uses various types of inbound, outbound and internal interactions that arise due to hyperlinks between and within Web sites.

ieee international conference on fuzzy systems | 2002

Fuzzy targeting of customers based on product attributes

V. Jain; Krishna Kummamuru; Raghu Krishnapuram; V. Agarwal

A skyline query returns a set of objects that are not dominated by other objects. An object is said to dominate another if it is closer to the query than the latter on all factors under consideration. In this paper, we consider the case where the similarity measures may be arbitrary and do not necessarily come from a metric space. We first explore middleware algorithms, analyze how skyline retrieval for non-metric spaces can be done on the middleware backend, and lay down a necessary and sufficient stopping condition for middleware-based skyline algorithms. We develop the Balanced Access Algorithm, which is provably more IO-friendly than the state-of-the-art algorithm for skyline query processing on middleware and show that BAA outperforms the latter by orders of magnitude. We also show that without prior knowledge about data distributions, it is unlikely to have a middleware algorithm that is more IO-friendly than BAA. In fact, we empirically show that BAA is very close to the absolute lower bound of IO costs for middleware algorithms. Further, we explore the non-middleware setting and devise an online algorithm for skyline retrieval which uses a recently proposed value space index over non-metric spaces (AL-Tree [10]). The AL-Tree based algorithm is able to prune subspaces and efficiently maintain candidate sets leading to better performance. We compare our algorithms to existing ones which can work with arbitrary similarity measures and show that our approaches are better in terms of computational and disk access costs leading to significantly better response times.

Explore More