Christopher Olston
Yahoo!
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Christopher Olston.
international conference on management of data | 2008
Christopher Olston; Benjamin Reed; Utkarsh Srivastava; Ravi Kumar; Andrew Tomkins
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.
international conference on management of data | 2003
Christopher Olston; Jing Jiang; Jennifer Widom
We consider an environment where distributed data sources continuously stream updates to a centralized processor that monitors continuous queries over the distributed data. Significant communication overhead is incurred in the presence of rapid update streams, and we propose a new technique for reducing the overhead. Users register continuous queries with precision requirements at the central stream processor, which installs filters at remote data sources. The filters adapt to changing conditions to minimize stream rates while guaranteeing that all continuous queries still receive the updates necessary to provide answers of adequate precision at all times. Our approach enables applications to trade precision for communication overhead at a fine granularity by individually adjusting the precision constraints of continuous queries over streams in a multi-query workload. Through experiments performed on synthetic data simulations and a real network monitoring implementation, we demonstrate the effectiveness of our approach in achieving low communication overhead compared with alternate approaches.
international world wide web conferences | 2004
Alexandros Ntoulas; Junghoo Cho; Christopher Olston
We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines.
very large data bases | 2009
Alan Gates; Olga Natkovich; Shubham Chopra; Pradeep Kamath; Shravan M. Narayanamurthy; Christopher Olston; Benjamin Reed; Santhosh Srinivasan; Utkarsh Srivastava
Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations. Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation. This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.
ACM Transactions on Computer-Human Interaction | 2003
Christopher Olston; Ed H. Chi
The two predominant paradigms for finding information on the Web are browsing and keyword searching. While they exhibit complementary advantages, neither paradigm alone is adequate for complex information goals that lend themselves partially to browsing and partially to searching. To integrate browsing and searching smoothly into a single interface, we introduce a novel approach called ScentTrails. Based on the concept of information scent developed in the context of information foraging theory, ScentTrails highlights hyperlinks to indicate paths to search results. This interface enables users to interpolate smoothly between searching and browsing to locate content matching complex information goals effectively. In a preliminary user study, ScentTrails enabled subjects to find information more quickly than by either searching or browsing alone.
international conference on management of data | 2001
Christopher Olston; Boon Thau Loo; Jennifer Widom
Caching approximate values instead of exact values presents an opportunity for performance gains in exchange for decreased precision. To maximize the performance improvement, cached approximations must be of appropriate precision: approximations that are too precise easily become invalid, requiring frequent refreshing, while overly imprecise approximations are likely to be useless to applications, which must then bypass the cache. We present a parameterized algorithm for adjusting the precision of cached approximations adaptively to achieve the best performance as data values, precision requirements, or workload vary. We consider interval approximations to numeric values but our ideas can be extended to other kinds of data and approximations. Our algorithm strictly generalizes previous adaptive caching algorithms for exact copies: we can set parameters to require that all approximations be exact, in which case our algorithm dynamically chooses whether or not to cache each data value. We have implemented our algorithm and tested it on synthetic and real-world data. A number of experimental results are reported, showing the effectiveness of our algorithm at maximizing performance, and also showing that in the special case of exact caching our algorithm performs as well as previous algorithms. In cases where bounded imprecision is acceptable, our algorithm easily outperforms previous algorithms for exact caching.
international conference on data engineering | 2005
Amit Manjhi; Vladislav Shkapenyuk; Kedar Dhamdhere; Christopher Olston
We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naive methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than naive approaches while providing the same error guarantees on answers.
symposium on cloud computing | 2010
Dionysios Logothetis; Christopher Olston; Benjamin Reed; Kevin C. Webb; Ken Yocum
This work addresses the need for stateful dataflow programs that can rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows, such as weekly web crawls, daily image/video uploads, log files, and growing social networks. While programmers may simply re-run the entire dataflow when new data arrives, this is grossly inefficient, increasing result latency and squandering hardware resources and energy. Alternatively, programmers may use prior results to incrementally incorporate the changes. However, current large-scale data processing tools, such as Map-Reduce or Dryad, limit how programmers incorporate and use state in data-parallel programs. Straightforward approaches to incorporating state can result in custom, fragile code and disappointing performance. This work presents a generalized architecture for continuous bulk processing (CBP) that raises the level of abstraction for building incremental applications. At its core is a flexible, groupwise processing operator that takes state as an explicit input. Unifying stateful programming with a data-parallel operator affords several fundamental opportunities for minimizing the movement of data in the underlying processing system. As case studies, we show how one can use a small set of flexible dataflow primitives to perform web analytics and mine large-scale, evolving graphs in an incremental fashion. Experiments with our prototype using real-world data indicate significant data movement and running time reductions relative to current practice. For example, incrementally computing PageRank using CBP can reduce data movement by 46% and cut running time in half.
international conference on management of data | 2002
Christopher Olston; Jennifer Widom
In environments where exact synchronization between source data objects and cached copies is not achievable due to bandwidth or other resource constraints, stale (out-of-date) copies are permitted. It is desirable to minimize the overall divergence between source objects and cached copies by selectively refreshing modified objects. We call the online process of selecting which objects to refresh in order to minimize divergence best-effort synchronization. In most approaches to best-effort synchronization, the cache coordinates the process and selects objects to refresh. In this paper, we propose a best-effort synchronization scheduling policy that exploits cooperation between data sources and the cache. We also propose an implementation of our policy that incurs low communication overhead even in environments with very large numbers of sources. Our algorithm is adaptive to wide fluctuations in available resources and data update rates. Through experimental simulation over synthetic and real-world data, we demonstrate the effectiveness of our algorithm, and we quantify the significant decrease in divergence achievable with source cooperation.
international conference on management of data | 2011
Christopher Olston; Greg I. Chiou; Laukik Chitnis; Francis Liu; Yiping Han; Mattias Larsson; Andreas Neumann; Vellanki B. N. Rao; Vijayanand Sankarasubramanian; Siddharth Seth; Chao Tian; Topher ZiCornell; Xiaodan Wang
This paper describes a workflow manager developed and deployed at Yahoo called Nova, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters. (Pig is a structured dataflow language and runtime for the Hadoop map-reduce system.) Nova is like data stream managers in its support for stateful incremental processing, but unlike them in that it deals with data in large batches using disk-based processing. Batched incremental processing is a good fit for a large fraction of Yahoos data processing use-cases, which deal with continually-arriving data and benefit from incremental algorithms, but do not require ultra-low-latency processing.