Peter Bodik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peter Bodik is active.

Explore More

Publication

Featured researches published by Peter Bodik.

information processing in sensor networks | 2004

Distributed regression: an efficient framework for modeling sensor network data

Carlos Guestrin; Peter Bodik; Romain Thibaux; Mark A. Paskin; Samuel Madden

We present distributed regression, an efficient and general framework for in-network modeling of sensor data. In this framework, the nodes of the sensor network collaborate to optimally fit a global function to each of their local measurements. The algorithm is based upon kernel linear regression, where the model takes the form of a weighted sum of local basis functions; this provides an expressive yet tractable class of models for sensor network data. Rather than transmitting data to one another or outside the network, nodes communicate constraints on the model parameters, drastically reducing the communication required. After the algorithm is run, each node can answer queries for its local region, or the nodes can efficiently transmit the parameters of the model to a user outside the network. We present an evaluation of the algorithm based upon data from a 48-node sensor network deployment at the Intel Research - Berkeley Lab, demonstrating that our distributed algorithm converges to the optimal solution at a fast rate and is very robust to packet losses.

symposium on cloud computing | 2010

Characterizing, modeling, and generating workload spikes for stateful services

Peter Bodik; Armando Fox; Michael J. Franklin; Michael I. Jordan; David A. Patterson

Evaluating the resiliency of stateful Internet services to significant workload spikes and data hotspots requires realistic workload traces that are usually very difficult to obtain. A popular approach is to create a workload model and generate synthetic workload, however, there exists no characterization and model of stateful spikes. In this paper we analyze five workload and data spikes and find that they vary significantly in many important aspects such as steepness, magnitude, duration, and spatial locality. We propose and validate a model of stateful spikes that allows us to synthesize volume and data spikes and could thus be used by both cloud computing users and providers to stress-test their infrastructure.

european conference on computer systems | 2012

Jockey: guaranteed job latency in data parallel clusters

Andrew D. Ferguson; Peter Bodik; Srikanth Kandula; Eric Boutin; Rodrigo Fonseca

Data processing frameworks such as MapReduce [8] and Dryad [11] are used today in business environments where customers expect guaranteed performance. To date, however, these systems are not capable of providing guarantees on job latency because scheduling policies are based on fair-sharing, and operators seek high cluster use through statistical multiplexing and over-subscription. With Jockey, we provide latency SLOs for data parallel jobs written in SCOPE. Jockey precomputes statistics using a simulator that captures the jobs complex internal dependencies, accurately and efficiently predicting the remaining run time at different resource allocations and in different stages of the job. Our control policy monitors a jobs performance, and dynamically adjusts resource allocation in the shared cluster in order to maximize the jobs economic utility while minimizing its impact on the rest of the cluster. In our experiments in Microsofts production Cosmos clusters, Jockey meets the specified job latency SLOs and responds to changes in cluster conditions.

european conference on computer systems | 2010

Fingerprinting the datacenter: automated classification of performance crises

Peter Bodik; Moises Goldszmidt; Armando Fox; Dawn B. Woodard; Hans Christian Andersen

Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenters state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.

acm special interest group on data communication | 2012

Surviving failures in bandwidth-constrained datacenters

Peter Bodik; Ishai Menache; Mosharaf Chowdhury; Pradeepkumar Mani; David A. Maltz; Ion Stoica

Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.

acm special interest group on data communication | 2015

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Virajith Jalaparti; Peter Bodik; Ishai Menache; Sriram Rao; Konstantin Makarychev; Matthew Caesar

To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network locality and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data locality, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarns capacity scheduler, Corral reduces the makespan of these workloads up to 33% and the median completion time up to 56%, with 20-90% reduction in data transferred across racks.

acm special interest group on data communication | 2013

Speeding up distributed request-response workflows

Virajith Jalaparti; Peter Bodik; Srikanth Kandula; Ishai Menache; Mikhail Rybalkin; Chenyu Yan

We found that interactive services at Bing have highly variable datacenter-side processing latencies because their processing consists of many sequential stages, parallelization across 10s-1000s of servers and aggregation of responses across the network. To improve the tail latency of such services, we use a few building blocks: reissuing laggards elsewhere in the cluster, new policies to return incomplete results and speeding up laggards by giving them more resources. Combining these building blocks to reduce the overall latency is non-trivial because for the same amount of resource (e.g., number of reissues), different stages improve their latency by different amounts. We present Kwiken, a framework that takes an end-to-end view of latency improvements and costs. It decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages. Through simulations with production traces, we show sizable gains; the 99th percentile of latency improves by over 50% when just 0.1% of the responses are allowed to have partial results and by over 40% for 25% of the services when just 5% extra resources are used for reissues.

workshop on automated control for datacenters and clouds | 2009

Automatic exploration of datacenter performance regimes

Peter Bodik; Rean Griffith; Charles A. Sutton; Armando Fox; Michael I. Jordan; David A. Patterson

Horizontally scalable Internet services present an opportunity to use automatic resource allocation strategies for system management in the datacenter. In most of the previous work, a controller employs a performance model of the system to make decisions about the optimal allocation of resources. However, these models are usually trained offline or on a small-scale deployment and will not accurately capture the performance of the controlled application. To achieve accurate control of the web application, the models need to be trained directly on the production system and adapted to changes in workload and performance of the application. In this paper we propose to train the performance model using an exploration policy that quickly collects data from different performance regimes of the application. The goal of our approach for managing the exploration process is to strike a balance between not violating the performance SLAs and the need to collect sufficient data to train an accurate performance model, which requires pushing the system close to its capacity. We show that by using our exploration policy, we can train a performance model of a Web 2.0 application in less than an hour and then immediately use the model in a resource allocation controller.

acm symposium on parallel algorithms and architectures | 2014

Brief announcement: deadline-aware scheduling of big-data processing jobs

Peter Bodik; Ishai Menache; Joseph Naor; Jonathan Yaniv

This paper presents a novel algorithm for scheduling big data jobs on large compute clusters. In our model, each job is represented by a DAG consisting of several stages linked by precedence constraints. The resource allocation per stage is malleable, in the sense that the processing time of a stage depends on the resources allocated to it (the dependency can be arbitrary in general).The goal of the scheduler is to maximize the total value of completed jobs, where the value for each job depends on its completion time. We design an algorithm for the problem which guarantees an expected constant approximation factor when the cluster capacity is sufficiently high. To the best of our knowledge, this is the first constant-factor approximation algorithm for the problem. The algorithm is based on formulating the problem as a linear program and then rounding an optimal (fractional) solution into a feasible (integral) schedule using randomized rounding.

IEEE Computer | 2017

Real-Time Video Analytics: The Killer App for Edge Computing

Ganesh Ananthanarayanan; Paramvir Bahl; Peter Bodik; Krishna Chintalapudi; Matthai Philipose; Lenin Ravindranath; Sudipta N. Sinha

Video analytics will drive a wide range of applications with great potential to impact society. A geographically distributed architecture of public clouds and edges that extend down to the cameras is the only feasible approach to meeting the strict real-time requirements of large-scale live video analytics.

Explore More