Mikhail Bilenko | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mikhail Bilenko is active.

Explore More

Publication

Featured researches published by Mikhail Bilenko.

knowledge discovery and data mining | 2003

Adaptive duplicate detection using learnable string similarity measures

Mikhail Bilenko; Raymond J. Mooney

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the fields domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

knowledge discovery and data mining | 2004

A probabilistic framework for semi-supervised clustering

Sugato Basu; Mikhail Bilenko; Raymond J. Mooney

Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.

international conference on machine learning | 2004

Integrating constraints and metric learning in semi-supervised clustering

Mikhail Bilenko; Sugato Basu; Raymond J. Mooney

Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms.

IEEE Intelligent Systems | 2003

Adaptive name matching in information integration

Mikhail Bilenko; Raymond J. Mooney; William W. Cohen; Pradeep Ravikumar; Stephen E. Fienberg

Identifying approximately duplicate database records that refer to the same entity is essential for information integration. The authors compare and describe methods for combining and learning textual similarity measures for name matching.

international acm sigir conference on research and development in information retrieval | 2007

Studying the use of popular destinations to enhance web search interaction

Ryen W. White; Mikhail Bilenko; Silviu Cucerzan

We present a novel Web search interaction feature which, for a given query, provides links to websites frequently visited by other users with similar information needs. These popular destinations complement traditional search results, allowing direct navigation to authoritative resources for the query topic. Destinations are identified using the history of search and browsing behavior of many users over an extended time period, whose collective behavior provides a basis for computing source authority. We describe a user study which compared the suggestion of destinations with the previously proposed suggestion of related queries, as well as with traditional, unaided Web search. Results show that search enhanced by destination suggestions outperforms other systems for exploratory tasks, with best performance obtained from mining past user behavior at query-level granularity.

international conference on data mining | 2006

Adaptive Blocking: Learning to Scale Up Record Linkage

Mikhail Bilenko; Beena Kamath; Raymond J. Mooney

Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.

knowledge discovery and data mining | 2011

Scaling up machine learning: parallel and distributed approaches

Ron Bekkerman; Mikhail Bilenko; John Langford

This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., recommender systems and object recognition in vision). The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production. Visit the tutorial website at http://hunch.net/~large_scale_survey/

international conference on data mining | 2005

Adaptive product normalization: using online learning for record linkage in comparison shopping

Mikhail Bilenko; S. Basil; Mehran Sahami

The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data. The learned similarity function is subsequently used in clustering to determine which records are co-referent and should be linked. We present an online machine learning method for addressing this problem, where a composite similarity function based on a linear combination of basis functions is learned incrementally. We illustrate the efficacy of this approach on several real-world datasets from an Internet comparison shopping site, and show that our method is able to effectively learn various distance functions for product data with differing characteristics. We also provide experimental results that show the importance of considering multiple performance measures in record linkage evaluation.

knowledge discovery and data mining | 2011

Predictive client-side profiles for personalized advertising

Mikhail Bilenko; Matthew Richardson

Personalization is ubiquitous in modern online applications as it provides significant improvements in user experience by adapting it to inferred user preferences. However, there are increasing concerns related to issues of privacy and control of the user data that is aggregated by online systems to power personalized experiences. These concerns are particularly significant for user profile aggregation in online advertising. This paper describes a practical, learning-driven client-side personalization approach for keyword advertising platforms, an emerging application previously not addressed in literature. Our approach relies on storing user-specific information entirely within the users control (in a browser cookie or browser local storage), thus allowing the user to view, edit or purge it at any time (e.g., via a dedicated webpage). We develop a principled, utility-based formulation for the problem of iteratively updating user profiles stored client-side, which relies on calibrated prediction of future user activity. While optimal profile construction is NP-hard for pay-per-click advertising with bid increments, it can be efficiently solved via a greedy approximation algorithm guaranteed to provide a near-optimal solution due to the fact that keyword profile utility is submodular: it exhibits the property of diminishing returns with increasing profile size. We empirically evaluate client-side keyword profiles for keyword advertising on a large-scale dataset from a major search engine. Experiments demonstrate that predictive client-side personalization allows ad platforms to retain almost all of the revenue gains from personalization even if they give users the freedom to opt out of behavior tracking backed by server-side storage. Additionally, we show that advertisers can potentially increase their return on investment significantly by utilizing bid increments for keyword profiles in their ad campaigns.

knowledge discovery and data mining | 2015

Scaling Up Stochastic Dual Coordinate Ascent

Kenneth Tran; Saghar Hosseini; Lin Xiao; Thomas William Finley; Mikhail Bilenko

Stochastic Dual Coordinate Ascent (SDCA) has recently emerged as a state-of-the-art method for solving large-scale supervised learning problems formulated as minimization of convex loss functions. It performs iterative, random-coordinate updates to maximize the dual objective. Due to the sequential nature of the iterations, it is typically implemented as a single-threaded algorithm limited to in-memory datasets. In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice. In addition, we describe a method for scaling the algorithm to out-of-memory datasets via multi-threaded deserialization of block-compressed data. This approach yields sufficient pseudo-randomness to provide the same convergence rate as random-order in-memory access. Empirical evaluation demonstrates the efficiency of the proposed methods and their ability to fully utilize computational resources and scale to out-of-memory datasets.

Explore More