Paul Heymann
Stanford University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul Heymann.
web search and data mining | 2008
Paul Heymann; Georgia Koutrika; Hector Garcia-Molina
Social bookmarking is a recent phenomenon which has the potential to give us a great deal of data about pages on the web. One major question is whether that data can be used to augment systems like web search. To answer this question, over the past year we have gathered what we believe to be the largest dataset from a social bookmarking site yet analyzed by academic researchers. Our dataset represents about forty million bookmarks from the social bookmarking site del.icio.us. We contribute a characterization of posts to del.icio. us: how many bookmarks exist (about 115 million), how fast is it growing, and how active are the URLs being posted about (quite active). We also contribute a characterization of tags used by bookmarkers. We found that certain tags tend to gravitate towards certain domains, and vice versa. We also found that tags occur in over 50 percent of the pages that they annotate, and in only 20 percent of cases do they not occur in the page text, backlink page text, or forward link page text of the pages they annotate. We conclude that social bookmarking can provide search data not currently provided by other sources, though it may currently lack the size and distribution of tags necessary to make a significant impact
IEEE Internet Computing | 2007
Paul Heymann; Georgia Koutrika; Hector Garcia-Molina
In recent years, social Web sites have become important components of the Web. With their success, however, has come a growing influx of spam. If left unchecked, spam threatens to undermine resource sharing, interactivity, and openness. This article surveys three categories of potential countermeasures - those based on detection, demotion, and prevention. Although many of these countermeasures have been proposed before for email and Web spam, the authors find that their applicability to social Web sites differs.
web search and data mining | 2009
Daniel Ramage; Paul Heymann; Christopher D. Manning; Hector Garcia-Molina
Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.
web search and data mining | 2010
Paul Heymann; Andreas Paepcke; Hector Garcia-Molina
A fundamental premise of tagging systems is that regular users can organize large collections for browsing and other tasks using uncontrolled vocabularies. Until now, that premise has remained relatively unexamined. Using library data, we test the tagging approach to organizing a collection. We find that tagging systems have three major large scale organizational features: consistency, quality, and completeness. In addition to testing these features, we present results suggesting that users produce tags similar to the topics designed by experts, that paid tagging can effectively supplement tags in a tagging system, and that information integration may be possible across tagging systems.
Archive | 2006
Paul Heymann; Hector Garcia-Molina
international acm sigir conference on research and development in information retrieval | 2008
Paul Heymann; Daniel Ramage; Hector Garcia-Molina
adversarial information retrieval on the web | 2007
Georgia Koutrika; Frans Adjie Effendi; Zoltán Gyöngyi; Paul Heymann; Hector Garcia-Molina
ACM Transactions on The Web | 2008
Georgia Koutrika; Frans Adjie Effendi; Zolt´n Gyöngyi; Paul Heymann; Hector Garcia-Molina
international world wide web conferences | 2011
Paul Heymann; Hector Garcia-Molina
web search and data mining | 2009
Paul Heymann; Hector Garcia-Molina