John Wilkes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John Wilkes is active.

Explore More

Publication

Featured researches published by John Wilkes.

symposium on cloud computing | 2011

CloudScale: elastic resource scaling for multi-tenant cloud systems

Zhiming Shen; Sethuraman Subbiah; Xiaohui Gu; John Wilkes

Elastic resource scaling lets cloud systems meet application service level objectives (SLOs) with minimum resource provisioning costs. In this paper, we present CloudScale, a system that automates fine-grained elastic resource scaling for multi-tenant cloud computing infrastructures. CloudScale employs online resource demand prediction and prediction error handling to achieve adaptive resource allocation without assuming any prior knowledge about the applications running inside the cloud. CloudScale can resolve scaling conflicts between applications using migration, and integrates dynamic CPU voltage/frequency scaling to achieve energy savings with minimal effect on application SLOs. We have implemented CloudScale on top of Xen and conducted extensive experiments using a set of CPU and memory intensive applications (RUBiS, Hadoop, IBM System S). The results show that CloudScale can achieve significantly higher SLO conformance than other alternatives with low resource and energy cost. CloudScale is non-intrusive and light-weight, and imposes negligible overhead (< 2% CPU in Domain 0) to the virtualized computing cluster.

conference on network and service management | 2010

PRESS: PRedictive Elastic ReSource Scaling for cloud systems

Zhenhuan Gong; Xiaohui Gu; John Wilkes

Cloud systems require elastic resource allocation to minimize resource provisioning costs while meeting service level objectives (SLOs). In this paper, we present a novel PRedictive Elastic reSource Scaling (PRESS) scheme for cloud systems. PRESS unobtrusively extracts fine-grained dynamic patterns in application resource demands and adjust their resource allocations automatically. Our approach leverages light-weight signal processing and statistical learning algorithms to achieve online predictions of dynamic application resource requirements. We have implemented the PRESS system on Xen and tested it using RUBiS and an application load trace from Google. Our experiments show that we can achieve good resource prediction accuracy with less than 5% over-estimation error and near zero under-estimation error, and elastic resource scaling can both significantly reduce resource waste and SLO violations.

european conference on computer systems | 2013

Omega: flexible, scalable schedulers for large compute clusters

Malte Schwarzkopf; Andy Konwinski; Michael Abd-El-Malek; John Wilkes

Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control. We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach -- all driven by real-life Google production workloads.

european conference on computer systems | 2015

Large-scale cluster management at Google with Borg

Abhishek Verma; Luis Pedrosa; Madhukar R. Korupolu; David Oppenheimer; Eric Tune; John Wilkes

Googles Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

ACM Queue | 2016

Borg, Omega, and Kubernetes

Brendan D. Burns; Brian K. Grant; David Oppenheimer; Eric Brewer; John Wilkes

Though widespread interest in software containers is a relatively recent phenomenon, at Google we have been managing Linux containers at scale for more than ten years and built three different container-management systems in that time. Each system was heavily influenced by its predecessors, even though they were developed for different reasons. This article describes the lessons we’ve learned from developing and operating them.

symposium on cloud computing | 2014

Long-term SLOs for reclaimed cloud computing resources

Marcus Carvalho; Walfredo Cirne; Francisco Vilar Brasileiro; John Wilkes

The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%.

measurement and modeling of computer systems | 2010

Do you know your IQ?: a research agenda for information quality in systems

Kimberly Keeton; Pankaj Mehra; John Wilkes

Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the correct information is being used to make a decision or take an action. Not understanding when information is of adequate quality can lead to bad decisions and catastrophic effects, including system outages, increased costs, lost revenue -- and worse. Quantifying information quality can help improve decision making, but the ultimate goal should be to select or construct information producers that have the appropriate balance between information quality and the cost of providing it. In this paper, we provide a brief introduction to the field, argue the case for applying information quality metrics in the systems domain, and propose a research agenda to explore this space.

network operations and management symposium | 2012

Obfuscatory obscanturism: Making workload traces of commercially-sensitive systems safe to release

Charles Reiss; John Wilkes; Joseph L. Hellerstein

Cloud providers such as Google are interested in fostering research on the daunting technical challenges they face in supporting planetary-scale distributed systems, but no academic organizations have similar scale systems on which to experiment. Fortunately, good research can still be done using traces of real-life production workloads, but there are risks in releasing such data, including inadvertently disclosing confidential or proprietary information, as happened with the Netflix Prize data. This paper discusses these risks, and our approach to them, which we call systematic obfuscation. It protects proprietary and personal data while leaving it possible to answer interesting research questions. We explain and motivate some of the risks and concerns and propose how they can best be mitigated, using as an example our recent publication of a month-long trace of a production system workload on a 11k-machine cluster.

international conference on cluster computing | 2014

Evaluating job packing in warehouse-scale computing

Abhishek Verma; Madhukar R. Korupolu; John Wilkes

One of the key factors in selecting a good scheduling algorithm is using an appropriate metric for comparing schedulers. But which metric should be used when evaluating schedulers for warehouse-scale (cloud) clusters, which have machines of different types and sizes, heterogeneous workloads with dependencies and constraints on task placement, and long-running services that consume a large fraction of the total resources? Traditional scheduler evaluations that focus on metrics such as queuing delay, makespan, and running time fail to capture important behaviors - and ones that rely on workload synthesis and scaling often ignore important factors such as constraints. This paper explains some of the complexities and issues in evaluating warehouse scale schedulers, focusing on what we find to be the single most important aspect in practice: how well they pack long-running services into a cluster. We describe and compare four metrics for evaluating the packing efficiency of schedulers in increasing order of sophistication: aggregate utilization, hole filling, workload inflation and cluster compaction.

international conference on autonomic computing | 2013