Bee-Chung Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bee-Chung Chen is active.

Explore More

Publication

Featured researches published by Bee-Chung Chen.

knowledge discovery and data mining | 2009

Regression-based latent factor models

Deepak Agarwal; Bee-Chung Chen

We propose a novel latent factor model to accurately predict response for large scale dyadic data in the presence of features. Our approach is based on a model that predicts response as a multiplicative function of row and column latent factors that are estimated through separate regressions on known row and column features. In fact, our model provides a single unified framework to address both cold and warm start scenarios that are commonplace in practical applications like recommender systems, online advertising, web search, etc. We provide scalable and accurate model fitting methods based on Iterated Conditional Mode and Monte Carlo EM algorithms. We show our model induces a stochastic process on the dyadic space with kernel (covariance) given by a polynomial function of features. Methods that generalize our procedure to estimate factors in an online fashion for dynamic applications are also considered. Our method is illustrated on benchmark datasets and a novel content recommendation application that arises in the context of Yahoo! Front Page. We report significant improvements over several commonly used methods on all datasets.

international world wide web conferences | 2009

Spatio-temporal models for estimating click-through rate

Deepak Agarwal; Bee-Chung Chen; Pradheep Elango

We propose novel spatio-temporal models to estimate click-through rates in the context of content recommendation. We track article CTR at a fixed location over time through a dynamic Gamma-Poisson model and combine information from correlated locations through dynamic linear regressions, significantly improving on per-location model. Our models adjust for user fatigue through an exponential tilt to the first-view CTR (probability of click on first article exposure) that is based only on user-specific repeat-exposure features. We illustrate our approach on data obtained from a module (Today Module) published regularly on Yahoo! Front Page and demonstrate significant improvement over commonly used baseline methods. Large scale simulation experiments to study the performance of our models under different scenarios provide encouraging results. Throughout, all modeling assumptions are validated via rigorous exploratory data analysis.

knowledge discovery and data mining | 2011

Localized factor models for multi-context recommendation

Deepak Agarwal; Bee-Chung Chen; Bo Long

Combining correlated information from multiple contexts can significantly improve predictive accuracy in recommender problems. Such information from multiple contexts is often available in the form of several incomplete matrices spanning a set of entities like users, items, features, and so on. Existing methods simultaneously factorize these matrices by sharing a single set of factors for entities across all contexts. We show that such a strategy may introduce significant bias in estimates and propose a new model that ameliorates this issue by positing local, context-specific factors for entities. To avoid over-fitting in contexts with sparse data, the local factors are connected through a shared global model. This sharing of parameters allows information to flow across contexts through multivariate regressions among local factors, instead of enforcing exactly the same factors for an entity, everywhere. Model fitting is done in an EM framework, we show that the E-step can be fitted through a fast multi-resolution Kalman filter algorithm that ensures scalability. Experiments on benchmark and real-world Yahoo! datasets clearly illustrate the usefulness of our approach. Our model significantly improves predictive accuracy, especially in cold-start scenarios.

knowledge discovery and data mining | 2011

Click shaping to optimize multiple objectives

Deepak Agarwal; Bee-Chung Chen; Pradheep Elango; Xuanhui Wang

Recommending interesting content to engage users is important for web portals (e.g. AOL, MSN, Yahoo!, and many others). Existing approaches typically recommend articles to optimize for a single objective, i.e., number of clicks. However a click is only the starting point of a users journey and subsequent downstream utilities such as time-spent and revenue are important. In this paper, we call the problem of recommending links to jointly optimize for clicks and post-click downstream utilities click shaping. We propose a multi-objective programming approach in which multiple objectives are modeled in a constrained optimization framework. Such a formulation can naturally incorporate various application-driven requirements. We study several variants that model different requirements as constraints and discuss some of the subtleties involved. We conduct our experiments on a large dataset from a real system by using a newly proposed unbiased evaluation methodology [17]. Through extensive experiments we quantify the tradeoff between different objectives under various constraints. Our experimental results show interesting characteristics of different formulations and our findings may provide valuable guidance to the design of recommendation engines for web portals.

international conference on data engineering | 2006

Learning from Aggregate Views

Bee-Chung Chen; Lei Chen; Raghu Ramakrishnan; David R. Musicant

In this paper, we introduce a new class of data mining problems called learning from aggregate views. In contrast to the traditional problem of learning from a single table of training examples, the new goal is to learn from multiple aggregate views of the underlying data, without access to the un-aggregated data. We motivate this new problem, present a general problem framework, develop learning methods for RFA (Restriction-Free Aggregate) views defined using COUNT, SUM, AVG and STDEV, and offer theoretical and experimental results that characterize the proposed methods.

international conference on data engineering | 2006

Toward a Query Language for Network Attack Data

Bee-Chung Chen; Vinod Yegneswaran; Paul Barford; Raghu Ramakrishnan

The growing sophistication and diversity of malicious activity in the Internet presents a serious challenge for network security analysts. In this paper, we describe our efforts to develop a database and query language for network attack data from firewalls, intrusion detection systems and honeynets. Our first step toward this objective is to develop a prototype database and query interface to identify coordinated scanning activity in network attack data. We have created a set of aggregate views and templatized SQL queries that consider timing, persistence, targeted services, spatial dispersion and temporal dispersion, thereby enabling us to evaluate coordinated scanning along these dimensions. We demonstrate the utility of the interface by conducting a case study on a set of firewall and intrusion detection system logs from Dshield.org. We show that the interface is able to identify general characteristics of coordinated activity as well as instances of unusual activity that would otherwise be difficult to mine from the data. These results highlight the potential for developing a more generalized query language for a broad class of network intrusion data. The case study also exposes some of the challenges we face in extending our system to more generalized queries over potentially vast quantities of data.

ACM Transactions on Knowledge Discovery From Data | 2009

Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

Bee-Chung Chen; Raghu Ramakrishnan; Jude W. Shavlik; Pradeep Tamma

How to mine massive datasets is a challenging problem with great potential value. Motivated by this challenge, much effort has concentrated on developing scalable versions of machine learning algorithms. However, the cost of mining large datasets is not just computational; preparing the datasets into the “right form” so that learning algorithms can be applied is usually costly, due to the human labor that is typically required and a large number of choices in data preparation, which include selecting different subsets of data and aggregating data at different granularities. We make the key observation that, for a number of practically motivated problems, these choices can be defined using database queries and analyzed in an automatic and systematic manner. Specifically, we propose a new class of data-mining problem, called bellwether analysis, in which the goal is to find a few query-defined predictors (e.g., first week sales of Peoria, IL of an item) that can be used to accurately predict the result of a target query (e.g., first year worldwide sales of the item) from a large number of queries that define candidate predictors. To make a prediction for a new item, the data needed to generate such predictors has to be collected (e.g., selling the new item in Peoria, IL for a week and collecting the sales data). A useful predictor is one that has high prediction accuracy and a low data-collection cost. We call such a cost-effective predictor a bellwether. This article introduces bellwether analysis, which integrates database query processing and predictive modeling into a single framework, and provides scalable algorithms for large datasets that cannot fit in main memory. Through a series of extensive experiments, we show that bellwethers do exist in real-world databases, and that our computation techniques achieve good efficiency on large datasets.

very large data bases | 2009

Adversarial-knowledge dimensions in data privacy

Bee-Chung Chen; Kristen LeFevre; Raghu Ramakrishnan

Privacy is an important issue in data publishing. Many organizations distribute non-aggregate personal data for research, and they must take steps to ensure that an adversary cannot predict sensitive information pertaining to individuals with high confidence. This problem is further complicated by the fact that, in addition to the published data, the adversary may also have access to other resources (e.g., public records and social networks relating individuals), which we call adversarial knowledge. A robust privacy framework should allow publishing organizations to analyze data privacy by means of not only data dimensions (data that a publishing organization has), but also adversarial-knowledge dimensions (information not in the data). In this paper, we first describe a general framework for reasoning about privacy in the presence of adversarial knowledge. Within this framework, we propose a novel multidimensional approach to quantifying adversarial knowledge. This approach allows the publishing organization to investigate privacy threats and enforce privacy requirements in the presence of various types and amounts of adversarial knowledge. Our main technical contributions include a multidimensional privacy criterion that is more intuitive and flexible than previous approaches to modeling background knowledge. In addition, we identify an important congregation property of the adversarial-knowledge dimensions. Based on this property, we provide algorithms for measuring disclosure and sanitizing data that improve computational efficiency several orders of magnitude over the best known techniques.

web search and data mining | 2010