Srujana Merugu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srujana Merugu is active.

Explore More

Publication

Featured researches published by Srujana Merugu.

international conference on data mining | 2005

A scalable collaborative filtering framework based on co-clustering

Thomas George; Srujana Merugu

Collaborative filtering-based recommender systems have become extremely popular due to the increase in Web-based activities such as e-commerce and online content distribution. Current collaborative filtering (CF) techniques such as correlation and SVD based methods provide good accuracy, but are computationally expensive and can be deployed only in static off-line settings. However, a number of practical scenarios require dynamic real-time collaborative filtering that can allow new users, items and ratings to enter the system at a rapid rate. In this paper, we consider a novel CF approach based on a proposed weighted co-clustering algorithm (Banerjee et al., 2004) that involves simultaneous clustering of users and items. We design incremental and parallel versions of the co-clustering algorithm and use it to build an efficient real-time CF framework. Empirical evaluation demonstrates that our approach provides an accuracy comparable to that of the correlation and matrix factorization based approaches at a much lower computational cost.

knowledge discovery and data mining | 2004

A generalized maximum entropy approach to bregman co-clustering and matrix approximation

Arindam Banerjee; Inderjit S. Dhillon; Joydeep Ghosh; Srujana Merugu; Dharmendra S. Modha

Co-clustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an information-theoretic co-clustering approach applicable to empirical joint probability distributions was proposed. In many situations, co-clustering of more general matrices is desired. In this paper, we present a substantially generalized co-clustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved. Analysis of the co-clustering problem leads to the minimum Bregman information principle, which generalizes the maximum entropy principle, and yields an elegant meta algorithm that is guaranteed to achieve local optimality. Our methodology yields new algorithms and also encompasses several previously known clustering and co-clustering algorithms based on alternate minimization.

symposium on principles of database systems | 2009

A web of concepts

Nilesh N. Dalvi; Ravi Kumar; Bo Pang; Raghu Ramakrishnan; Andrew Tomkins; Philip Bohannon; S. Sathiya Keerthi; Srujana Merugu

We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.

knowledge discovery and data mining | 2005

A distributed learning framework for heterogeneous data sources

Srujana Merugu; Joydeep Ghosh

We present a probabilistic model-based framework for distributed learning that takes into account privacy restrictions and is applicable to scenarios where the different sites have diverse, possibly overlapping subsets of features. Our framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global probabilistic model based on the union of the features available at all the sites. We provide a mathematical formulation of the model integration problem using the maximum likelihood and maximum entropy principles and describe iterative algorithms that are guaranteed to converge to the optimal solution. For certain commonly occurring special cases involving hierarchically ordered feature sets or conditional independence, we obtain closed form solutions and use these to propose an efficient alternative scheme by recursive decomposition of the model integration problem. To address interpretability concerns, we also present a modified formulation where the global model is assumed to belong to a specified parametric family. Finally, to highlight the generality of our framework, we provide empirical results for various learning tasks such as clustering and classification on different kinds of datasets consisting of continuous vector, categorical and directional attributes. The results show that high quality global models can be obtained without much loss of privacy.

international conference on management of data | 2009

Purple SOX extraction management system

Philip Bohannon; Srujana Merugu; Cong Yu; Vipul Agarwal; Pedro DeRose; Arun Shankar Iyer; Ankur Jain; Vinay Kakade; Mridul Muralidharan; Raghu Ramakrishnan; Warren Shen

We describe the Purple SOX (PSOX) EMS, a prototype Extraction Management System currently being built at Yahoo!. The goal of the PSOX EMS is to manage a large number of sophisticated extraction pipelines across different application domains, at the web scale and with minimum human involvement. Three key value propositions are described: extensibility, the ability to swap in and out extraction operators; explainability, the ability to track the provenance of extraction results; and social feedback support, the facility for gathering and reconciling multiple, potentially conflicting sources.

international conference on machine learning | 2004

An information theoretic analysis of maximum likelihood mixture estimation for exponential families

Arindam Banerjee; Inderjit S. Dhillon; Joydeep Ghosh; Srujana Merugu

An important task in unsupervised learning is maximum likelihood mixture estimation (MLME) for exponential families. In this paper, we prove a mathematical equivalence between this MLME problem and the rate distortion problem for Bregman divergences. We also present new theoretical results in rate distortion theory for Bregman divergences. Further, an analysis of the problems as a trade-off between compression and preservation of information is presented that yields the information bottleneck method as an interesting special case.

web search and data mining | 2011

Collective extraction from heterogeneous web lists

Ashwin Machanavajjhala; Arun Shankar Iyer; Philip Bohannon; Srujana Merugu

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites. We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.

conference on information and knowledge management | 2010

The anatomy of a click: modeling user behavior on web information systems

Kunal Punera; Srujana Merugu

The ultimate goal of information retrieval science continues to be providing relevant information to users while placing minimal cognitive load on them. The retrieval and presentation of relevant information (say, search results) as well as any dynamic system behavior (e.g., search engine re-ranking) depends acutely on estimating user intent. Hence, it is critical to use all the available information about user behavior at any stage of a search-session to accurately infer the user intent. However, the simplistic interfaces provided by search engines in order to minimize the user cognitive effort, and intrinsic limits imposed by privacy concerns, latency requirements, and other web instrumentation challenges, result in only a subset of user actions that are predictive of the search intent being captured. In this paper, we present a dynamic Bayesian network (DBN) that models user interaction with general web information systems, taking into account both observed (clicks etc.) as well as hidden (result examinations etc.) user actions. Our model goes beyond the ranked list information access paradigm and gives a solution where arbitrary context information can be incorporated in a principled fashion. To account for heterogeneity in user behavior as well as information access tasks, we further propose a bi-clustering algorithm that partitions users and tasks, and learns separate models for each bicluster. We instantiate this general DBN model for a typical static search interface comprising of a single query box and a ranked list of search results using a set of seven common user actions and various predictive state attributes. Experimental results on real-world web search log data indicate that one can obtain superior predictive performance on various session properties (such as click positions and reformulations) compared to simpler instantiations of the DBN.

international world wide web conferences | 2011

Smart news feeds for social networks using scalable joint latent factor models

Himabindu Lakkaraju; Angshu Rai; Srujana Merugu

Social networks such as Facebook and Twitter offer a huge opportunity to tap the collective wisdom (both published and yet to be published) of all the participating users in order to address the information needs of individual users in a highly contextualized fashion using rich user-specific information. Realizing this opportunity, however, requires addressing two key limitations of current social networks: (a) difficulty in discovering relevant content beyond the immediate neighborhood, (b) lack of support for information filtering based on semantics, content source and linkage. We propose a scalable framework for constructing smart news feeds based on predicting user-post relevance using multiple signals such as text content and attributes of users and posts, and various user-user, post-post and user-post relations (e.g. friend, comment, author relations). Our solution comprises of two steps where the first step ensures scalability by selecting a small set of user-post dyads with potentially interesting interactions using inverted feature indexes. The second step models the interactions associated with the selected dyads via a joint latent factor model, which assumes that the user/post content and relationships can be effectively captured by a common latent representation of the users and posts. Experiments on a Facebook dataset using the proposed model lead to improved precision/recall on relevant posts indicating potential for constructing superior quality news feeds.

knowledge discovery and data mining | 2005

Gene classification: issues and challenges for relational learning

Claudia Perlich; Srujana Merugu

We present ongoing research that applies statistical relational learning techniques, in particular, propositionalization, to the challenging and interesting real-world domain of functional gene classification of the Yeast genome Sachharomyces Cerevisiae. The main objective of this paper is to identify and describe the structural and statistical properties of this domain and examine how they conflict with the assumptions of relational learning approaches. Such properties are, in fact, shared by many relational application domains and potential solutions will be of interest far beyond the particular genetic application. We show in the last part some preliminary experimental results on potential approaches to overcome such limitations by extending the existing automated feature construction strategies to accommodate the specific domain properties.

Explore More