Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Walfredo Cirne is active.

Publication


Featured researches published by Walfredo Cirne.


measurement and modeling of computer systems | 2010

Towards characterizing cloud backend workloads: insights from Google compute clusters

Asit K. Mishra; Joseph L. Hellerstein; Walfredo Cirne; Chita R. Das

The advent of cloud computing promises highly available, efficient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend consisting of many large compute clusters that are shared by application tasks with diverse service level requirements for throughput, latency, and jitter. These considerations impact (a) capacity planning to determine which machine resources must grow and by how much and (b) task scheduling to achieve high machine utilization and to meet service level objectives.n Both capacity planning and task scheduling require a good understanding of task resource consumption (e.g., CPU and memory usage). This in turn demands simple and accurate approaches to workload classification-determining how to form groups of tasks (workloads) with similar resource demands. One approach to workload classification is to make each task its own workload. However, this approach scales poorly since tens of thousands of tasks execute daily on Google compute clusters. Another approach to workload classification is to view all tasks as belonging to a single workload. Unfortunately, applying such a coarse-grain workload classification to the diversity of tasks running on Google compute clusters results in large variances in predicted resource consumptions.n This paper describes an approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet. Our methodology for workload classification consists of: (1) identifying the workload dimensions; (2) constructing task classes using an off-the-shelf algorithm such as k-means; (3) determining the break points for qualitative coordinates within the workload dimensions; and (4) merging adjacent task classes to reduce the number of workloads. We use the foregoing, especially the notion of qualitative coordinates, to glean several insights about the Google Cloud Backend: (a) the duration of task executions is bimodal in that tasks either have a short duration or a long duration; (b) most tasks have short durations; and (c) most resources are consumed by a few tasks with long duration that have large demands for CPU and memory.


international conference on cluster computing | 2012

Characterization and Comparison of Cloud versus Grid Workloads

Sheng Di; Derrick Kondo; Walfredo Cirne

A new era of Cloud Computing has emerged, but the characteristics of Cloud load in data centers is not perfectly clear. Yet this characterization is critical for the design of novel Cloud job and resource management systems. In this paper, we comprehensively characterize the job/task load and host load in a real-world production data center at Google Inc. We use a detailed trace of over 25 million tasks across over 12,500 hosts. We study the differences between a Google data center and other Grid/HPC systems, from the perspective of both work load (w.r.t. jobs and tasks) and host load (w.r.t. machines). In particular, we study the job length, job submission frequency, and the resource utilization of jobs in the different systems, and also investigate valuable statistics of machines maximum load, queue state and relative usage levels, with different job priorities and resource attributes. We find that the Google data center exhibits finer resource allocation with respect to CPU and memory than that of Grid/HPC systems. Google jobs are always submitted with much higher frequency and they are much shorter than Grid jobs. As such, Google host load exhibits higher variance and noise.


ieee international conference on high performance computing data and analytics | 2012

Host load prediction in a Google compute cloud with a Bayesian model

Sheng Di; Derrick Kondo; Walfredo Cirne

Prediction of host load in Cloud systems is critical for achieving service-level agreements. However, accurate prediction of host load in Clouds is extremely challenging because it fluctuates drastically at small timescales. We design a prediction method based on Bayes model to predict the mean load over a long-term time interval, as well as the mean load in consecutive future time intervals. We identify novel predictive features of host load that capture the expectation, predictability, trends and patterns of host load. We also determine the most effective combinations of these features for prediction. We evaluate our method using a detailed one-month trace of a Google data center with thousands of machines. Experiments show that the Bayes method achieves high accuracy with a mean squared error of 0.0014. Moreover, the Bayes method improves the load prediction accuracy by 5.6 -- 50% compared to other state-of-the-art methods based on moving averages, auto-regression, and/or noise filters.


Journal of Parallel and Distributed Computing | 2014

Google hostload prediction based on Bayesian model with optimized feature combination

Sheng Di; Derrick Kondo; Walfredo Cirne

We design a novel prediction method with Bayes model to predict a load fluctuation pattern over a long-term interval, in the context of Google data centers. We exploit a set of features that capture the expectation, trend, stability and patterns of recent host loads. We also investigate the correlations among these features and explore the most effective combinations of features with various training periods. All of the prediction methods are evaluated using Google trace with 10,000+?heterogeneous hosts. Experiments show that our Bayes method improves the long-term load prediction accuracy by 5.6%-50%, compared to other state-of-the-art methods based on moving average, auto-regression, and/or noise filters. Mean squared error of pattern prediction with Bayes method can be approximately limited in 1 0 - 8 , 1 0 - 5 . Through a load balancing scenario, we confirm the precision of pattern prediction in finding a set of idlest/busiest hosts from among 10,000+?hosts can be improved by about 7% on average. We devise an exponentially segmented pattern model for the hostload prediction.We devise a Bayes method and exploit 10 features to find the best-fit combination.We evaluate the Bayes method and 8 other well-known load prediction methods.The experiment is based on Google trace with over 10?k hosts and millions of jobs.The pattern prediction with Bayes method has much higher precision than others.


symposium on cloud computing | 2014

Long-term SLOs for reclaimed cloud computing resources

Marcus Carvalho; Walfredo Cirne; Francisco Vilar Brasileiro; John Wilkes

The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%.


job scheduling strategies for parallel processing | 2012

Web-Scale Job Scheduling

Walfredo Cirne; Eitan Frachtenberg

Web datacenters and clusters can be larger than the world’s largest supercomputers, and run workloads that are at least as heterogeneous and complex as their high-performance computing counterparts. And yet little is known about the unique job scheduling challenges of these environments. This article aims to ameliorate this situation. It discusses the challenges of running web infrastructure and describes several techniques to address them. It also presents some of the problems that remain open in the field.


ieee/acm international symposium cluster, cloud and grid computing | 2015

An Availability-on-Demand Mechanism for Datacenters

Siqi Shen; Alexandru Iosup; Assaf Israel; Walfredo Cirne; Danny Raz; Dick H. J. Epema

Data enters are at the core of a wide variety of daily ICT utilities, ranging from scientific computing to online gaming. Due to the scale of todays data enters, the failure of computing resources is a common occurrence that may disrupt the availability of ICT services, leading to revenue loss. Although many high availability (HA) techniques have been proposed to mask resource failures, datacenter users -- who rent datacenter resources and use them to provide ICT utilities to a global population -- still have limited management options for dynamically selecting and configuring HA techniques. In this work, we propose Availability-on-Demand (AoD), a mechanism consisting of an API that allows datacenter users to specify availability requirements which can dynamically change, and an availability-aware scheduler that dynamically manages computing resources based on user-specified requirements. The mechanism operates at the level of individual service instance, thus enabling fine-grained control of availability, for example during sudden requirement changes and periodic operations. Through realistic, trace-based simulations, we show that the AoD mechanism can achieve high availability with low cost. The AoD approach consumes about the same CPU hours but with higher availability than approaches which use HA techniques randomly. Moreover, comparing to an ideal approach which has perfect predictions about failures, it consumes 13% to 31% more CPU hours but achieves similar availability for critical parts of applications.


Journal of Internet Services and Applications | 2011

Perspectives on cloud computing: interviews with five leading scientists from the cloud community

Gordon S. Blair; Fabio Kon; Walfredo Cirne; Dejan S. Milojicic; Raghu Ramakrishnan; Daniel A. Reed; Dilma Da Silva

AbstractCloud computing is currently one of the major topics in distributed systems, with large numbers of papers being written on the topic, with major players in the industry releasing a range of software platforms offering novel Internet-based services and, most importantly, evidence of real impact on end user communities in terms of approaches to provisioning software services. Cloud computing though is at a formative stage, with a lot of hype surrounding the area, and this makes it difficult to see the true contribution and impact of the topic.Cloud computing is a central topic for the Journal of Internet Services and Applications (JISA) and indeed the most downloaded paper from the first year of JISA is concerned with the state-of-the-art and research challenges related to cloud computing. The Editors-in-Chief, Fabio Kon and Gordon Blair, therefore felt it was timely to seek clarification on the key issues around cloud computing and hence invited five leading scientists from industrial organizations central to cloud computing to answer a series of questions on the topic.


job scheduling strategies for parallel processing | 2015

Open Issues in Cloud Resource Management

Narayan Desai; Walfredo Cirne

Cloud computing seeks to provide a global computing utility service to a broad base of users. Resource management and scheduling are prime challenges in building such a service. In this paper, we first provide an overview of clouds from a resource management and scheduling perspective. We then discuss key challenges in these areas, describing prime opportunities for new research.


Archive | 2010

SYSTEM AND METHOD OF ACTIVE RISK MANAGEMENT TO REDUCE JOB DE-SCHEDULING PROBABILITY IN COMPUTER CLUSTERS

Geeta Chaudhry; Walfredo Cirne; Scott David Johnson

Collaboration


Dive into the Walfredo Cirne's collaboration.

Researchain Logo
Decentralizing Knowledge