Pavel Berkhin
Yahoo!
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pavel Berkhin.
Grouping Multidimensional Data | 2006
Pavel Berkhin
Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.
Internet Mathematics | 2005
Pavel Berkhin
This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search. We also discuss linkbased search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing.
Internet Mathematics | 2006
Pavel Berkhin
We introduce a novel bookmark-coloring algorithm (BCA) that computes authority weights over the web pages utilizing the web hyperlink structure. The computed vector (BCV) is similar to the PageRank vector defined for a page-specific teleportation. Meanwhile, BCA is very fast, and BCV is sparse. BCA also has important algebraic properties. If several BCVs corresponding to a set of pages (called hub) are known, they can be leveraged in computing arbitrary BCV via a straightforward algebraic process and hub BCVs can be efficiently computed and encoded.
Grouping Multidimensional Data | 2006
Marc Teboulle; Pavel Berkhin; Inderjit S. Dhillon; Yuqiang Guan; Jacob Kogan
The aim of this chapter is to demonstrate that many results attributed to the classical k-means clustering algorithm with the squared Euclidean distance can be extended to many other distance-like functions. We focus on entropy-like distances based on Bregman [88] and Csiszar [119] divergences, which have previously been shown to be useful in various optimization and clustering contexts. Further, the chapter reviews various versions of the classical k-means and BIRCH clustering algorithms with squared Euclidean distance and considers modifications of these algorithms with the proposed families of distance-like functions. Numerical experiments with some of these modifications are reported.
knowledge discovery and data mining | 2000
Jonathan D. Becher; Pavel Berkhin; Edmund Freeman
Having access to large data sets for the purpose of predictive data mining does not guarantee good models, even when the size of the training data is virtually unlimited. Instead, careful data preprocessing is required, including data cleansing, handling missing values, attribute representation and encoding, and generating derived attributes. In particular, the selection of the most appropriate subset of attributes to include is a critical step in building an accurate and efficient model. We describe an automated approach to the exploration, preprocessing, and selection of the optimal attribute subset whose goal is to simplify the KDD process and dramatically shorten the time to build a model. Our implementation finds inappropriate and suspicious attributes, performs target dependency analysis, determining optimal attribute encoding, generates new derived attributes, and provides a flexible approach to attribute selection. We present results generated by an industrial KDD environment called the Accrue Decision Series on several real world Web data sets.
Archive | 2006
Zhichen Xu; Pavel Berkhin; Daniel E. Rose; Jiangchang Mao; David Ku; Qi Lu; Eckart Walther; Chung-Man Tam
very large data bases | 2006
Zoltán Gyöngyi; Pavel Berkhin; Hector Garcia-Molina; Jan O. Pedersen
Archive | 2007
John Canny; Shi Zhong; Scott Gaffney; Chad Brower; Pavel Berkhin; George H. John
Archive | 2006
Pavel Berkhin; Usama M. Fayyad; Prabhakar Raghavan; Andrew Tomkins
Archive | 2006
Usama M. Fayyad; Pavel Berkhin; Andrew Tomkins; Rajesh Parekh; Jignashu Parikh; Wellspring Sculley Ii David