Benjamin Reed | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin Reed is active.

Explore More

Publication

Featured researches published by Benjamin Reed.

international conference on management of data | 2008

Pig latin: a not-so-foreign language for data processing

Christopher Olston; Benjamin Reed; Utkarsh Srivastava; Ravi Kumar; Andrew Tomkins

There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

very large data bases | 2009

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Alan Gates; Olga Natkovich; Shubham Chopra; Pradeep Kamath; Shravan M. Narayanamurthy; Christopher Olston; Benjamin Reed; Santhosh Srinivasan; Utkarsh Srivastava

Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations. Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation. This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.

dependable systems and networks | 2011

Zab: High-performance broadcast for primary-backup systems

Flavio Junqueira; Benjamin Reed; Marco Serafini

Zab is a crash-recovery atomic broadcast algorithm we designed for the ZooKeeper coordination service. ZooKeeper implements a primary-backup scheme in which a primary process executes clients operations and uses Zab to propagate the corresponding incremental state changes to backup processes1. Due the dependence of an incremental state change on the sequence of changes previously generated, Zab must guarantee that if it delivers a given state change, then all other changes it depends upon must be delivered first. Since primaries may crash, Zab must satisfy this requirement despite crashes of primaries.

symposium on cloud computing | 2010

Stateful bulk processing for incremental analytics

Dionysios Logothetis; Christopher Olston; Benjamin Reed; Kevin C. Webb; Ken Yocum

This work addresses the need for stateful dataflow programs that can rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows, such as weekly web crawls, daily image/video uploads, log files, and growing social networks. While programmers may simply re-run the entire dataflow when new data arrives, this is grossly inefficient, increasing result latency and squandering hardware resources and energy. Alternatively, programmers may use prior results to incrementally incorporate the changes. However, current large-scale data processing tools, such as Map-Reduce or Dryad, limit how programmers incorporate and use state in data-parallel programs. Straightforward approaches to incorporating state can result in custom, fragile code and disappointing performance. This work presents a generalized architecture for continuous bulk processing (CBP) that raises the level of abstraction for building incremental applications. At its core is a flexible, groupwise processing operator that takes state as an explicit input. Unifying stateful programming with a data-parallel operator affords several fundamental opportunities for minimizing the movement of data in the underlying processing system. As case studies, we show how one can use a small set of flexible dataflow primitives to perform web analytics and mine large-scale, evolving graphs in an incremental fashion. Experiments with our prototype using real-world data indicate significant data movement and running time reductions relative to current practice. For example, incrementally computing PageRank using CBP can reduce data movement by 46% and cut running time in half.

international symposium on microarchitecture | 2000

Authenticating network attached storage

Benjamin Reed; Edward Gustav Chron; Randal C. Burns; Darrell D. E. Long

We present an architecture for network-authenticated disks that implements distributed file systems without file servers or encryption. Our system provides network clients with direct network access to remote storage.

Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware | 2008

A simple totally ordered broadcast protocol

Benjamin Reed; Flavio Junqueira

This is a short overview of a totally ordered broadcast protocol used by ZooKeeper, called Zab. It is conceptually easy to understand, is easy to implement, and gives high performance. In this paper we present the requirements ZooKeeper makes on Zab, we show how the protocol is used, and we give an overview of how the protocol works.

international performance computing and communications conference | 2001

Strong security for distributed file systems

Ethan L. Miller; Darrell D. E. Long; William E. Freeman; Benjamin Reed

We have developed a scheme to secure network-attached storage systems against many types of attacks. Our system uses strong cryptography to hide data from unauthorized users; someone gaining complete access to a disk cannot obtain any useful data from the system, and backups can be done without allowing the super-user access to unencrypted data. While denial-of-service attacks cannot be prevented, our system detects forged data. The system was developed using a raw disk, and can be integrated into common file systems. We discuss the design and security tradeoffs such a distributed file system makes. Our design guards against both remote intruders and those who gain physical access to the disk, using just enough security to thwart both types of attacks. This security can be achieved with little penalty to performance. We discuss the security operations that are necessary for each type of operation, and show that there is no longer any reason not to include strong encryption and authentication in network file systems.

dependable systems and networks | 2011

Lock-free transactional support for large-scale storage systems

Flavio Junqueira; Benjamin Reed; Maysam Yabandeh

In this paper, we introduce ReTSO, a reliable and efficient design for transactional support in large-scale storage systems. ReTSO uses a centralized scheme and implements snapshot isolation, a property that guarantees that read operations of a transaction read a consistent snapshot of the data stored. The centralized scheme of ReTSO enables a lock-free commit algorithm that prevents unre-leased locks of a failed transaction from blocking others. We analyze the bottlenecks in a single-server implementation of transactional logic and propose solutions for each. The experimental results show that our implementation can service up to 72K transaction per second (TPS), which is an order of magnitude larger than the maximum achieved traffic in similar data storage systems. Consequently, we do not expect ReTSO to be a bottleneck even for current large distributed storage systems.

internet measurement conference | 2011

Overclocking the Yahoo!: CDN for faster web page loads

Mohammad Al-Fares; Khaled Elmeleegy; Benjamin Reed; Igor Gashinsky

Fast-loading web pages are key for a positive user experience. Unfortunately, a large number of users suffer from page load times of many seconds, especially for pages with many embedded objects. Most of this time is spent fetching the page and its objects over the Internet. This paper investigates the impact of optimizations that improve the delivery of content from edge servers at the Yahoo! Content Delivery Network (CDN) to the end users. To this end, we analyze packet traces of 12.3M TCP connections originating from users across the world and terminating at the Yahoo! CDN. Using these traces, we characterize key user-connection metrics at the network, transport, and the application layers. We observe high Round Trip Times (RTTs) and inflated number of round trips per page download (RTT multipliers). Due to inefficiencies in TCPs slow start and the HTTP protocol, we found several opportunities to reduce the RTT multiplier, e.g. increasing TCPs Initial Congestion Window (ICW), using TCP Appropriate Byte Counting (ABC), and using HTTP pipelining. Using live workloads, we experimentally study the micro effects of these optimizations on network connectivity, e.g. packet loss rate. To evaluate the macro effects of these optimizations on the overall page load time, we use realistic synthetic workloads in a closed laboratory environment. We find that compounding HTTP pipelining with increasing the ICW size can lead to reduction in page load times by up to 80%. We also find that no one configuration fits all users, e.g. increasing the TCP ICW to a certain size may help some users while hurting others.

international conference on management of data | 2011

Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows

Christopher Olston; Benjamin Reed

We demonstrate a novel dataflow introspection framework called Inspector Gadget, which makes it easy to create custom monitoring and debugging add-ons to an existing dataflow engine such as Pig. The framework is motivated by a series of informal user interviews, which revealed that dataflow monitoring and debugging needs are both pressing and diverse. Of the 14 monitoring/debugging behaviors requested by users, we were able to implement 12 in Inspector Gadget, in just a few hundred lines of (Java) code each.

Explore More