Pavel Berkhin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pavel Berkhin is active.

Explore More

Publication

Featured researches published by Pavel Berkhin.

Grouping Multidimensional Data | 2006

A Survey of Clustering Data Mining Techniques

Pavel Berkhin

Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

Internet Mathematics | 2005

A Survey on PageRank Computing

Pavel Berkhin

This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search. We also discuss linkbased search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing.

Internet Mathematics | 2006

Bookmark-Coloring Algorithm for Personalized PageRank Computing

Pavel Berkhin

We introduce a novel bookmark-coloring algorithm (BCA) that computes authority weights over the web pages utilizing the web hyperlink structure. The computed vector (BCV) is similar to the PageRank vector defined for a page-specific teleportation. Meanwhile, BCA is very fast, and BCV is sparse. BCA also has important algebraic properties. If several BCVs corresponding to a set of pages (called hub) are known, they can be leveraged in computing arbitrary BCV via a straightforward algebraic process and hub BCVs can be efficiently computed and encoded.

Grouping Multidimensional Data | 2006

Clustering with Entropy-Like k-Means Algorithms

Marc Teboulle; Pavel Berkhin; Inderjit S. Dhillon; Yuqiang Guan; Jacob Kogan

The aim of this chapter is to demonstrate that many results attributed to the classical k-means clustering algorithm with the squared Euclidean distance can be extended to many other distance-like functions. We focus on entropy-like distances based on Bregman [88] and Csiszar [119] divergences, which have previously been shown to be useful in various optimization and clustering contexts. Further, the chapter reviews various versions of the classical k-means and BIRCH clustering algorithms with squared Euclidean distance and considers modifications of these algorithms with the proposed families of distance-like functions. Numerical experiments with some of these modifications are reported.

knowledge discovery and data mining | 2000

Automating exploratory data analysis for efficient data mining

Jonathan D. Becher; Pavel Berkhin; Edmund Freeman

Having access to large data sets for the purpose of predictive data mining does not guarantee good models, even when the size of the training data is virtually unlimited. Instead, careful data preprocessing is required, including data cleansing, handling missing values, attribute representation and encoding, and generating derived attributes. In particular, the selection of the most appropriate subset of attributes to include is a critical step in building an accurate and efficient model. We describe an automated approach to the exploration, preprocessing, and selection of the optimal attribute subset whose goal is to simplify the KDD process and dramatically shorten the time to build a model. Our implementation finds inappropriate and suspicious attributes, performs target dependency analysis, determining optimal attribute encoding, generates new derived attributes, and provides a flexible approach to attribute selection. We present results generated by an industrial KDD environment called the Accrue Decision Series on several real world Web data sets.

Archive | 2006