David M. Pennock | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David M. Pennock is active.

Explore More

Publication

Featured researches published by David M. Pennock.

international acm sigir conference on research and development in information retrieval | 2002

Methods and metrics for cold-start recommendations

Andrew I. Schein; Alexandrin Popescul; Lyle H. Ungar; David M. Pennock

We have developed a method for recommending items that combines content and collaborative data under a single probabilistic framework. We benchmark our algorithm against a naïve Bayes classifier on the cold-start problem, where we wish to recommend items that no one in the community has yet rated. We systematically explore three testing methodologies using a publicly available data set, and explain how these methods apply to specific real-world applications. We advocate heuristic recommenders when benchmarking to give competent baseline performance. We introduce a new performance metric, the CROC curve, and demonstrate empirically that the various components of our testing strategy combine to obtain deeper understanding of the performance characteristics of recommender systems. Though the emphasis of our testing is on cold-start recommending, our methods for recommending and evaluation are general.

international world wide web conferences | 2002

Using web structure for classifying and describing web pages

Eric J. Glover; Kostas Tsioutsiouliklis; Steve Lawrence; David M. Pennock; Gary William Flake

The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.

international world wide web conferences | 2002

The structure of broad topics on the web

Soumen Chakrabarti; Mukul M. Joshi; Kunal Punera; David M. Pennock

The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

conference on information and knowledge management | 2002

Inferring hierarchical descriptions

Eric J. Glover; David M. Pennock; Steve Lawrence; Robert Krovetz

We create a statistical model for inferring hierarchical term relationships about a topic, given only a small set of example web pages on the topic, without prior knowledge of any hierarchical information. The model can utilize either the full text of the pages in the cluster or the context of links to the pages. To support the model, we use ground truth data taken from the category labels in the Open Directory. We show that the model accurately separates terms in the following classes: self terms describing the cluster, parent terms describing more general concepts, and child terms describing specializations of the cluster. For example, for a set of biology pages, sample parent, self, and child terms are science, biology, and genetics respectively. We create an algorithm to predict parent, self, and child terms using the new model, and compare the predictions to the ground truth data. The algorithm accurately ranks a majority of the ground truth terms highly, and identifies additional complementary terms missing in the Open Directory.

very large data bases | 2002

REFEREE: an open framework for practical testing of recommender systems using ResearchIndex

Dan Cosley; Steve Lawrence; David M. Pennock

Automated recommendation (e.g., personalized product recommendation on an ecommerce web site) is an increasingly valuable service associated with many databases--typically online retail catalogs and web logs. Currently, a major obstacle for evaluating recommendation algorithms is the lack of any standard, public, real-world testbed appropriate for the task. In an attempt to fill this gap, we have created REFEREE, a framework for building recommender systems using ResearchIndex--a huge online digital library of computer science research papers--so that anyone in the research community can develop, deploy, and evaluate recommender systems relatively easily and quickly. Research Index is in many ways ideal for evaluating recommender systems, especially so-called hybrid recommenders that combine information filtering and collaborative filtering techniques. The documents in the database are associated with a wealth of content information (author, title, abstract, full text) and collaborative information (user behaviors), as well as linkage information via the citation structure. Our framework supports more realistic evaluation metrics that assess user buy-in directly, rather than resorting to offline metrics like prediction accuracy that may have little to do with end user utility. The sheer scale of ResearchIndex (over 500,000 documents with thousands of user accesses per hour) will force algorithm designers to make real-world trade-offs that consider performance, not just accuracy. We present our own tradeoff decisions in building an example hybrid recommender called PD-Live. The algorithm uses content-based similarity information to select a set of documents from which to recommend, and collaborative information to rank the documents. PD-Live performs reasonably well compared to other recommenders in ResearchIndex.

knowledge discovery and data mining | 2001

Extracting collective probabilistic forecasts from web games

David M. Pennock; Steve Lawrence; Finn Årup Nielsen; C. Lee Giles

Game sites on the World Wide Web draw people from around the world with specialized interests, skills, and knowledge. Data from the games often reflects the players expertise and will to win. We extract probabilistic forecasts from data obtained from three online games: the Hollywood Stock Exchange (HSX), the Foresight Exchange (FX), and the Formula One Pick Six (F1P6) competition. We find that all three yield accurate forecasts of uncertain future events. In particular, prices of so-called movie stocks on HSX are good indicators of actual box office returns. Prices of HSX securities in Oscar, Emmy, and Grammy awards correlate well with observed frequencies of winning. FX prices are reliable indicators of future developments in science and technology. Collective predictions from players in the F1 competition serve as good forecasts of true race outcomes. In some cases, forecasts induced from game data are more reliable than expert opinions. We argue that web games naturally attract well-informed and well-motivated players, and thus offer a valuable and oft-overlooked source of high-quality data with significant predictive value.

electronic commerce | 2003

Information incorporation in online in-Game sports betting markets

Sandip Debnath; David M. Pennock; C. Lee Giles; Steve Lawrence

We analyze data from

conference on information and knowledge management | 2000

Persistence of information on the web: analyzing citations contained in research articles

Steve Lawrence; Frans Coetzee; Eric J. Glover; Gary William Flake; David M. Pennock; Bob Krovetz; Finn Årup Nielsen; A. Kruger; Lee Giles

technical symposium on computer science education | 1996

Home-study software: flexible, interactive, and distributed software for independent study

Christopher Connelly; Alan W. Biermann; David M. Pennock; Peter Wu

online in-game sports betting markets (where betting is allowed continuously throughout a game), including 34 markets based on soccer (European football) games from the 2002 World Cup, and 18 basketball games from the 2002 USA National Basketball Association (NBA) championship. We show that prices on average approach the correct outcome over time, and the price dynamics in the markets are closely coupled with game events, agreeing with efficient market assumptions. We also examine qualitative distinctions between the two types of games.

Computer Science Education | 1996

Home Study Software: Complementary Systems for Computer Science Courses

Christopher Connelly; Alan W. Biermann; David M. Pennock; Peter Wu

We analyze the persistence of information on the web, looking at the percentage of invalid URLs contained in academic articles within the CiteSeer (ResearchIndex) database. The number of URLs contained in the papers has increased from an average of 0.06 in 1993 to 1.6 in 1999. We found that a significant percentage of URLs are now invalid, ranging from 23% for 1999 articles, to 53% for 1994. We also found that for almost all of the invalid URLs, it was possible to locate the information (or highly related information) in an alternate location, primarily with the use of search engines. However, the ability to relocate missing information varied according to search experience and effort expended. Citation practices suggest that more information may be lost in the future unless these practices are improved. We discuss persistent URL standards and their usage, and give recommendations for citing URLs in research articles as well as for finding the new location of invalid URLs.

Explore More