Is this you? Create Your Porfile

Yanpei Chen

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanpei Chen is active.

Explore More

Publication

Featured researches published by Yanpei Chen.

very large data bases | 2012

Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Yanpei Chen; Sara Alspaugh; Randy H. Katz

Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

modeling, analysis, and simulation on computer and telecommunication systems | 2011

The Case for Evaluating MapReduce Performance Using Workload Suites

Yanpei Chen; Archana Ganapathi; Rean Griffith; Randy H. Katz

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.

workshop on research on enterprise networking | 2009

Understanding TCP incast throughput collapse in datacenter networks

Yanpei Chen; Rean Griffith; Junda Liu; Randy H. Katz; Anthony D. Joseph

TCP Throughput Collapse, also known as Incast, is a pathological behavior of TCP that results in gross under-utilization of link capacity in certain many-to-one communication patterns. This phenomenon has been observed by others in distributed storage, MapReduce and web-search workloads. In this paper we focus on understanding the dynamics of Incast. We use empirical data to reason about the dynamic system of simultaneously communicating TCP entities. We propose an analytical model to account for the observed Incast symptoms, identify contributory factors, and explore the efficacy of solutions proposed by us and by others.

international conference on data engineering | 2010

Statistics-driven workload modeling for the Cloud

Archana Ganapathi; Yanpei Chen; Armando Fox; Randy H. Katz; David A. Patterson

A recent trend for data-intensive computations is to use pay-as-you-go execution environments that scale transparently to the user. However, providers of such environments must tackle the challenge of configuring their system to provide maximal performance while minimizing the cost of resources used. In this paper, we use statistical models to predict resource requirements for Cloud computing applications. Such a prediction framework can guide system design and deployment decisions such as scale, scheduling, and capacity. In addition, we present initial design of a workload generator that can be used to evaluate alternative configurations without the overhead of reproducing a real workload. This paper focuses on statistical modeling and its application to data-intensive workloads.

european conference on computer systems | 2012

Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Yanpei Chen; Sara Alspaugh; Dhruba Borthakur; Randy H. Katz

MapReduce workloads have evolved to include increasing amounts of time-sensitive, interactive data analysis; we refer to such workloads as MapReduce with Interactive Analysis (MIA). Such workloads run on large clusters, whose size and cost make energy efficiency a critical concern. Prior works on MapReduce energy efficiency have not yet considered this workload class. Increasing hardware utilization helps improve efficiency, but is challenging to achieve for MIA workloads. These concerns lead us to develop BEEMR (Berkeley Energy Efficient MapReduce), an energy efficient MapReduce workload manager motivated by empirical analysis of real-life MIA traces at Facebook. The key insight is that although MIA clusters host huge data volumes, the interactive jobs operate on a small fraction of the data, and thus can be served by a small pool of dedicated machines; the less time-sensitive jobs can run on the rest of the cluster in a batch fashion. BEEMR achieves 40-50% energy savings under tight design constraints, and represents a first step towards improving energy efficiency for an increasingly important class of datacenter workloads.

acm special interest group on data communication | 2010

To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Yanpei Chen; Archana Ganapathi; Randy H. Katz

Compression enables us to shift resource bottlenecks between IO and CPU. In modern datacenters, where energy efficiency is a growing concern, the benefits of using compression have not been completely exploited. As MapReduce represents a common computation framework for Internet datacenters, we develop a decision algorithm that helps MapReduce users identify when and where to use compression. For some jobs, using compression gives energy savings of up to 60%. We believe our findings will provide signficant impact on improving datacenter energy efficiency.

symposium on operating systems principles | 2011

Design implications for enterprise storage systems via multi-dimensional trace analysis

Yanpei Chen; Kiran Srinivasan; Garth R. Goodson; Randy H. Katz

Enterprise storage systems are facing enormous challenges due to increasing growth and heterogeneity of the data stored. Designing future storage systems requires comprehensive insights that existing trace analysis methods are ill-equipped to supply. In this paper, we seek to provide such insights by using a new methodology that leverages an objective, multi-dimensional statistical technique to extract data access patterns from network storage system traces. We apply our method on two large-scale real-world production network storage system traces to obtain comprehensive access patterns and design insights at user, application, file, and directory levels. We derive simple, easily implementable, threshold-based design optimizations that enable efficient data placement and capacity optimization strategies for servers, consolidation policies for clients, and improved caching performance for both.

local computer networks | 2008

Energy efficient Ethernet encodings

Yanpei Chen; Tracy Xiaoxiao Wang; Randy H. Katz

The energy efficiency of network elements is becoming more prominent, with growing concern for Internet power consumption and heat dissipation in datacenters and communications closets. Previous work has looked at energy efficient wireless topologies, network nodes, routers, and protocols. In considering a fresh redesign of the Internet datacenter for energy efficiency, we believe that energy efficient encodings are worthy of study. In this work, we re-examine the choice of Ethernet encoding, develop an associated energy model, evaluate current encodings, and propose new encodings. We found that simpler encodings are more energy efficient, with power savings of around 20% for the best encoding. Our work represents a first step in re-examining the established assumptions and practices of the PHY level of the network stack with respect to energy.

international conference on big data | 2016

Data quality: Experiences and lessons from operationalizing big data

Archana Ganapathi; Yanpei Chen

Data quality issues pose a significant barrier to operationalizing big data. They pertain to the meaning of the data, the consistency of that meaning, the human interpretation of results, and the contexts in which the results are used. Data quality issues arise after organizations have moved past clear-cut technical solutions to early bottlenecks in using data. Left unaddressed, such issues can and have led to high profile missteps, and raise doubts about the data-driven world view altogether. In this paper, we share real-world case studies of tackling data quality challenges across industry verticals. We present initial ideas on how to systematically address data quality issues via technology. The success of operationalizing big data will depend on the quality of data involved, and whether such data causes uncertainty and disruptions, or delivers genuine knowledge and value.

Archive | 2010