Sugato Basu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sugato Basu is active.

Explore More

Publication

Featured researches published by Sugato Basu.

knowledge discovery and data mining | 2004

A probabilistic framework for semi-supervised clustering

Sugato Basu; Mikhail Bilenko; Raymond J. Mooney

Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.

international conference on machine learning | 2004

Integrating constraints and metric learning in semi-supervised clustering

Mikhail Bilenko; Sugato Basu; Raymond J. Mooney

Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms.

very large data bases | 2009

PLANET: massively parallel learning of tree ensembles with MapReduce

Biswanath Panda; Joshua Seth Herbach; Sugato Basu; Roberto J. Bayardo

Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Googles computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.

knowledge discovery and data mining | 2005

Model-based overlapping clustering

Arindam Banerjee; Chase Krumpelman; Joydeep Ghosh; Sugato Basu; Raymond J. Mooney

While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.

knowledge discovery and data mining | 2009

Predicting bounce rates in sponsored search advertisements

D. Sculley; Robert G. Malkin; Sugato Basu; Roberto J. Bayardo

This paper explores an important and relatively unstudied quality measure of a sponsored search advertisement: bounce rate. The bounce rate of an ad can be informally defined as the fraction of users who click on the ad but almost immediately move on to other tasks. A high bounce rate can lead to poor advertiser return on investment, and suggests search engine users may be having a poor experience following the click. In this paper, we first provide quantitative analysis showing that bounce rate is an effective measure of user satisfaction. We then address the question, can we predict bounce rate by analyzing the features of the advertisement? An affirmative answer would allow advertisers and search engines to predict the effectiveness and quality of advertisements before they are shown. We propose solutions to this problem involving large-scale learning methods that leverage features drawn from ad creatives in addition to their keywords and landing pages.

knowledge discovery and data mining | 2007

iLink : search and routing in social networks

Jeffrey Davitz; Jiye Yu; Sugato Basu; David Gutelius; Alexandra Harris

The growth of Web 2.0 and fundamental theoretical breakthroughs have led to an avalanche of interest in social networks. This paper focuses on the problem of modeling how social networks accomplish tasks through peer production style collaboration. We propose a general interaction model for the underlying social networks and then a specific model (iLink for social search and message routing. A key contribution here is the development of a general learning framework for making such online peer production systems work at scale. The iLink model has been used to develop a system for FAQ generation in a social network (FAQtory), and experience with its application in the context of a full-scale learning-driven workflow application (CALO) is reported. We also discuss methods of adapting iLink technology for use in military knowledge sharing portals and other message routing systems. Finally, the paper shows the connection of iLink to SQM, a theoretical model for social search that is a generalization of Markov Decision Processes and the popular Pagerank model.

knowledge discovery and data mining | 2001

Evaluating the novelty of text-mined rules using lexical knowledge

Sugato Basu; Raymond J. Mooney; Krupakar V. Pasupuleti; Joydeep Ghosh

In this paper, we present a new method of estimating the novelty of rules discovered by data-mining methods using WordNet, a lexical knowledge-base of English words. We assess the novelty of a rule by the average semantic distance in a knowledge hierarchy between the words in the antecedent and the consequent of the rule - the more the average distance, more is the novelty of the rule. The novelty of rules extracted by the DiscoTEX text-mining system on Amazon.com book descriptions were evaluated by both human subjects and by our algorithm. By computing correlation coefficients between pairs of human ratings and between human and automatic ratings, we found that the automatic scoring of rules based on our novelty measure correlates with human judgments about as well as human judgments correlate with one another. @Text mining

knowledge discovery and data mining | 2010

User browsing models: relevance versus examination

Ramakrishnan Srikant; Sugato Basu; Ni Wang; Daryl Pregibon

There has been considerable work on user browsing models for search engine results, both organic and sponsored. The click-through rate (CTR) of a result is the product of the probability of examination (will the user look at the result) times the perceived relevance of the result (probability of a click given examination). Past papers have assumed that when the CTR of a result varies based on the pattern of clicks in prior positions, this variation is solely due to changes in the probability of examination. We show that, for sponsored search results, a substantial portion of the change in CTR when conditioned on prior clicks is in fact due to a change in the relevance of results for that query instance, not just due to a change in the probability of examination. We then propose three new user browsing models, which attribute CTR changes solely to changes in relevance, solely to changes in examination (with an enhanced model of user behavior), or to both changes in relevance and examination. The model that attributes all the CTR change to relevance yields substantially better predictors of CTR than models that attribute all the change to examination, and does only slightly worse than the model that attributes CTR change to both relevance and examination. For predicting relevance, the model that attributes all the CTR change to relevance again does better than the model that attributes the change to examination. Surprisingly, we also find that one model might do better than another in predicting CTR, but worse in predicting relevance. Thus it is essential to evaluate user browsing models with respect to accuracy in predicting relevance, not just CTR.

ieee computer society annual symposium on vlsi | 2003

Joint minimization of power and area in scan testing by scan cell reordering

Shalini Ghosh; Sugato Basu; Nur A. Touba

This paper describes a technique for re-ordering of scan cells to minimize power dissipation that is also capable of reducing the area overhead of the circuit compared to a random ordering of the scan cells. For a given test set, our proposed greedy algorithm finds the (locally) optimal scan cell ordering for a given value of /spl lambda/, which is a trade-off parameter that can be used by the designer to specify the relative importance of area overhead minimization and power minimization. The strength of our algorithm lies in the fact that we use a novel dynamic minimum transition fill (MT-fill) technique to fill the unspecified bits in the test vector. Experiments performed on the ISCAS-89 benchmark suite show a reduction in power (70% for s13207, /spl lambda/ = 500) as well as a reduction in layout area (6.72% for s13207, /spl lambda/ = 500).

international test conference | 2004

Reducing power consumption in memory ECC checkers

Shalini Ghosh; Sugato Basu; Nur A. Touba

A method is proposed for reducing power consumption in memory ECC checker circuitry that provides SEC-DED. The degrees of freedom in selecting the parity check matrix are used to minimize power with little or no impact on area and delay. The power minimization method is applied to two popular SEC-DED codes: standard Hamming codes and odd-column-weight Hsiao codes. Experiments on actual memory traces of Spec and MediaBench benchmarks indicate that considering power in addition to area and delay when selecting the parity check matrix can result in power reductions of up to 27% for Hsiao codes and up to 41% for Hamming codes.

Explore More