Dennis Fetterly | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dennis Fetterly is active.

Explore More

Publication

Featured researches published by Dennis Fetterly.

Software - Practice and Experience | 2004

A large-scale study of the evolution of web pages

Dennis Fetterly; Mark S. Manasse; Marc Najork; Janet L. Wiener

How fast does the Web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the Web, including all the popular search engines, but few studies have been performed to date to answer them.

lasers and electro optics society meeting | 2003

On the evolution of clusters of near-duplicate Web pages

Dennis Fetterly; Mark S. Manasse; Marc Najork

We expand on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million Web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of all Web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.

symposium on operating systems principles | 2013

Dandelion: a compiler and runtime for heterogeneous systems

Christopher J. Rossbach; Yuan Yu; Jon Currey; Jean-Philippe Martin; Dennis Fetterly

Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging. Dandelion is a system designed to address this programmability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution contexts including CPUs, GPUs, FPGAs, and the cloud. It adopts the .NET LINQ (Language INtegrated Query) approach, integrating data-parallel operators into general purpose programming languages such as C# and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools. Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of .NET code on GPUs, Dandelion cross-compiles .NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the design and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads.

international acm sigir conference on research and development in information retrieval | 2009

The impact of crawl policy on web search effectiveness

Dennis Fetterly; Nick Craswell; Vishwa Vinay

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.

web search and data mining | 2012

Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Marc Najork; Dennis Fetterly; Alan Halverson; Krishnaram Kenthapadi; Sreenivas Gollapudi

Many phenomena and artifacts such as road networks, social networks and the web can be modeled as large graphs and analyzed using graph algorithms. However, given the size of the underlying graphs, efficient implementation of basic operations such as connected component analysis, approximate shortest paths, and link-based ranking (e.g. PageRank) becomes challenging. This paper presents an empirical study of computations on such large graphs in three well-studied platform models, viz., a relational model, a data-parallel model, and a special-purpose in-memory model. We choose a prototypical member of each platform model and analyze the computational efficiencies and requirements for five basic graph operations used in the analysis of real-world graphs viz., PageRank, SALSA, Strongly Connected Components (SCC), Weakly Connected Components (WCC), and Approximate Shortest Paths (ASP). Further, we characterize each platform in terms of these computations using model-specific implementations of these algorithms on a large web graph. Our experiments show that there is no single platform that performs best across different classes of operations on large graphs. While relational databases are powerful and flexible tools that support a wide variety of computations, there are computations that benefit from using special-purpose storage systems and others that can exploit data-parallel platforms.

asia information retrieval symposium | 2013

Duplicate News Story Detection Revisited

Omar Alonso; Dennis Fetterly; Mark S. Manasse

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

web search and data mining | 2013

Robust query rewriting using anchor data

Nick Craswell; Bodo Billerbeck; Dennis Fetterly; Marc Najork

Query rewriting algorithms can be used as a form of query expansion, by combining the users original query with automatically generated rewrites. Rewriting algorithms bring linguistic datasets to bear without the need for iterative relevance feedback, but most studies of rewriting have used proprietary datasets such as large-scale search logs. By contrast this paper uses readily available data, particularly ClueWeb09 link text with over 1.2 billion anchor phrases, to generate rewrites. To avoid overfitting, our initial analysis is performed using Million Query Track queries, leading us to identify three algorithms which perform well. We then test the algorithms on Web and newswire data. Results show good properties in terms of robustness and early precision.

Archive | 2011

Scaling Up Machine Learning: Large-Scale Machine Learning Using DryadLINQ

Mihai Budiu; Dennis Fetterly; Michael Isard; Frank McSherry; Yuan Yu

The main motivation behind the development of DryadLINQ was to make it easier for non-specialists to write general purpose, scalable programs that can operate on very large input datasets. In order to appeal to non-specialists we designed the programming interface to use a high level of abstraction that insulates the programmer from most of the detail and complexity of parallel and distributed execution. In order to support general-purpose computing we embedded these high-level abstractions in .NET, giving developers access to full-featured programming languages with rich type systems and proven mechanisms (such as classes and libraries) for managing complex, long-lived and geographically distributed software projects. In order to support scalability over very large data and compute clusters the DryadLINQ compiler generates code for the Dryad runtime, a well-tested and highly efficient distributed execution engine.

international world wide web conferences | 2008

Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008)

Carlos Castillo; Kumar Chellapilla; Dennis Fetterly

Adversarial IR in general, and search engine spam, in particular, are engaging research topics with a real-world impact for Web users, advertisers and publishers. The AIRWeb workshop will bring researchers and practitioners in these areas together, to present and discuss state-of-the-art techniques as well as real-world experiences. Given the continued growth in search engine spam creation and detection efforts, we expect interest in this AIRWeb to surpass that of the previous three editions of the workshop (held jointly with WWW 2005, SIGIR 2006, and WWW 2007 respectively).

international acm sigir conference on research and development in information retrieval | 2008

Search effectiveness with a breadth-first crawl

Dennis Fetterly; Nick Craswell; Vishwa Vinay

Previous scalability experiments found that early precision improves as collection size increases. However, that was under the assumption that a collections documents are all sampled with uniform probability from the same population. We contrast this to a large breadth-first web crawl, an important scenario in real-world Web search, where the early documents have quite different characteristics from the later documents.

Explore More