Joseph L. Hellerstein

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joseph L. Hellerstein is active.

Explore More

Publication

Featured researches published by Joseph L. Hellerstein.

integrated network management | 2001

Using control theory to achieve service level objectives in performance management

Sujay Parekh; Neha Gandhi; Joseph L. Hellerstein; Dawn M. Tilbury; T. S. Jayram; Joseph Phillip Bigus

A widely used approach to achieving service level objectives for a software system (e.g., an email server) is to add a controller that manipulates the target systems tuning parameters. We describe a methodology for designing such controllers for software systems that builds on classical control theory. The classical approach proceeds in two steps: system identification and controller design. In system identification, we construct mathematical models of the target system. Traditionally, this has been based on a first-principles approach, using detailed knowledge of the target system. Such models can be complex and difficult to build, validate, use, and maintain. In our methodology, a statistical (ARMA) model is fit to historical measurements of the target being controlled. These models are easier to obtain and use and allow us to apply control-theoretic design techniques to a larger class of systems. When applied to a Lotus Notes groupware server, we obtain model-fits with R2 no lower than 75% and as high as 98%. In controller design, an analysis of the models leads to a controller that will achieve the service level objectives. We report on an analysis of a closed-loop system using an integral control law with Lotus Notes as the target. The objective is to maintain a reference queue length. Using root-locus analysis from control theory, we are able to predict the occurrence (or absence) of controller-induced oscillations in the systems response. Such oscillations are undesirable since they increase variability, thereby resulting in a failure to meet the service level objective. We implement this controller for a real Lotus Notes system, and observe a remarkable correspondence between the behavior of the real system and the predictions of the analysis. This indicates that the control theoretic analysis is sufficient to select controller parameters that meet the desired goals, and the need for simulations is reduced.

measurement and modeling of computer systems | 2010

Towards characterizing cloud backend workloads: insights from Google compute clusters

Asit K. Mishra; Joseph L. Hellerstein; Walfredo Cirne; Chita R. Das

The advent of cloud computing promises highly available, efficient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend consisting of many large compute clusters that are shared by application tasks with diverse service level requirements for throughput, latency, and jitter. These considerations impact (a) capacity planning to determine which machine resources must grow and by how much and (b) task scheduling to achieve high machine utilization and to meet service level objectives. Both capacity planning and task scheduling require a good understanding of task resource consumption (e.g., CPU and memory usage). This in turn demands simple and accurate approaches to workload classification-determining how to form groups of tasks (workloads) with similar resource demands. One approach to workload classification is to make each task its own workload. However, this approach scales poorly since tens of thousands of tasks execute daily on Google compute clusters. Another approach to workload classification is to view all tasks as belonging to a single workload. Unfortunately, applying such a coarse-grain workload classification to the diversity of tasks running on Google compute clusters results in large variances in predicted resource consumptions. This paper describes an approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet. Our methodology for workload classification consists of: (1) identifying the workload dimensions; (2) constructing task classes using an off-the-shelf algorithm such as k-means; (3) determining the break points for qualitative coordinates within the workload dimensions; and (4) merging adjacent task classes to reduce the number of workloads. We use the foregoing, especially the notion of qualitative coordinates, to glean several insights about the Google Cloud Backend: (a) the duration of task executions is bimodal in that tasks either have a short duration or a long duration; (b) most tasks have short durations; and (c) most resources are consumed by a few tasks with long duration that have large demands for CPU and memory.

international conference on data engineering | 2001

Mining partially periodic event patterns with unknown periods

Sheng Ma; Joseph L. Hellerstein

Periodic behavior is common in real-world applications. However in many cases, periodicities are partial in that they are present only intermittently. The authors study such intermittent patterns, which they refer to as p-patterns. The formulation of p-patterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for p-patterns as two sub-tasks: (1) finding the periods of p-patterns and (2) mining temporal associations. For (2), a level-wise algorithm is used. For (1), we develop a novel approach based on a chi-squared test, and study its performance in the presence of noise. Further we develop two algorithms for mining p-patterns based on the order in which the aforementioned sub-tasks are performed: the period-first algorithm and the association-first algorithm. Our results show that the association-first algorithm has a higher tolerance to noise; the period-first algorithm is more computationally efficient and provides flexibility as to the specification of support levels. In addition, we apply the period-first algorithm to mining data collected from two production computer networks, a process that led to several actionable insights.

network operations and management symposium | 2004

The CHAMPS system: change management with planning and scheduling

Alexander Keller; Joseph L. Hellerstein; Joel L. Wolf; Kun-Lung Wu; Vijaya Krishnan

Change management is a process by which IT systems are modified to accommodate considerations such as software fixes, hardware upgrades and performance enhancements. This paper discusses the CHAMPS system, a prototype under development at IBM Research for Change Management with Planning and Scheduling. The CHAMPS system is able to achieve a very high degree of parallelism for a set of tasks by exploiting detailed factual knowledge about the structure of a distributed system from dependency information at runtime. In contrast, todays systems expect an administrator to provide such insights, which is often not the case. Furthermore, the optimization techniques we employ allow the CHAMPS system to come up with a very high quality solution for a mathematically intractable problem in a time which scales nicely with the problem size. We have implemented the CHAMPS system and have applied it in a TPC-W environment that implements an on-line book store application.

american control conference | 2002

MIMO control of an Apache web server: modeling and controller design

Neha Gandhi; Dawn M. Tilbury; Yixin Diao; Joseph L. Hellerstein; Sujay Parekh

This paper considers the efficacy of feedback control in improving the performance of computing systems. Computing systems typically have many competing performance goals which are affected by several external variables. A feedback control strategy is desirable because well established techniques exist to handle these performance trade-offs and external disturbances. In order to employ such a strategy, decisions need to be made about the inputs, outputs, sample time, model type, and performance measures. This paper describes this process, which is often nebulous for computing systems, in the context of an Apache web server. A linear multi-input multi-output model of the system is identified experimentally and used to design several feedback controllers. Experimental results are presented showing the problems associated with a pure pole placement design and effectiveness of LQ control based techniques. The paper concludes with a discussion of future work.

symposium on cloud computing | 2011

Modeling and synthesizing task placement constraints in Google compute clusters

Bikash Sharma; Victor Chudnovsky; Joseph L. Hellerstein; Rasekh Rifaat; Chita R. Das

Evaluating the performance of large compute clusters requires benchmarks with representative workloads. At Google, performance benchmarks are used to obtain performance metrics such as task scheduling delays and machine resource utilizations to assess changes in application codes, machine configurations, and scheduling algorithms. Existing approaches to workload characterization for high performance computing and grids focus on task resource requirements for CPU, memory, disk, I/O, network, etc. Such resource requirements address how much resource is consumed by a task. However, in addition to resource requirements, Google workloads commonly include task placement constraints that determine which machine resources are consumed by tasks. Task placement constraints arise because of task dependencies such as those related to hardware architecture and kernel version. This paper develops methodologies for incorporating task placement constraints and machine properties into performance benchmarks of large compute clusters. Our studies of Google compute clusters show that constraints increase average task scheduling delays by a factor of 2 to 6, which often results in tens of minutes of additional task wait time. To understand why, we extend the concept of resource utilization to include constraints by introducing a new metric, the Utilization Multiplier (UM). UM is the ratio of the resource utilization seen by tasks with a constraint to the average utilization of the resource. UM provides a simple model of the performance impact of constraints in that task scheduling delays increase with UM. Last, we describe how to synthesize representative task constraints and machine properties, and how to incorporate this synthesis into existing performance benchmarks. Using synthetic task constraints and machine properties generated by our methodology, we accurately reproduce performance metrics for benchmarks of Google compute clusters with a discrepancy of only 13% in task scheduling delay and 5% in resource utilization.

international conference on autonomic computing | 2012

Dynamic energy-aware capacity provisioning for cloud computing environments

Qi Zhang; Mohamed Faten Zhani; Shuo Zhang; Quanyan Zhu; Raouf Boutaba; Joseph L. Hellerstein

Data centers have recently gained significant popularity as a cost-effective platform for hosting large-scale service applications. While large data centers enjoy economies of scale by amortizing initial capital investment over large number of machines, they also incur tremendous energy cost in terms of power distribution and cooling. An effective approach for saving energy in data centers is to adjust dynamically the data center capacity by turning off unused machines. However, this dynamic capacity provisioning problem is known to be challenging as it requires a careful understanding of the resource demand characteristics as well as considerations to various cost factors, including task scheduling delay, machine reconfiguration cost and electricity price fluctuation. In this paper, we provide a control-theoretic solution to the dynamic capacity provisioning problem that minimizes the total energy cost while meeting the performance objective in terms of task scheduling delay. Specifically, we model this problem as a constrained discrete-time optimal control problem, and use Model Predictive Control (MPC) to find the optimal control policy. Through extensive analysis and simulation using real workload traces from Googles compute clusters, we show that our proposed framework can achieve significant reduction in energy cost, while maintaining an acceptable average scheduling delay for individual tasks.

Ibm Systems Journal | 2002

Predictive algorithms in the management of computer systems

Ricardo Vilalta; Chidanand Apte; Joseph L. Hellerstein; Sheng Ma; Sholom M. Weiss

Predictive algorithms play a crucial role in systems management by alerting the user to potential failures. We report on three case studies dealing with the prediction of failures in computer systems: (1) long-term prediction of performance variables (e.g., disk utilization), (2) short-term prediction of abnormal behavior (e.g., threshold violations), and (3) short-term prediction of system events (e.g., router failure). Empirical results show that predictive algorithms can be successfully employed in the estimation of performance variables and the prediction of critical events.

IEEE Journal on Selected Areas in Communications | 2005

A control theory foundation for self-managing computing systems

Yixin Diao; Joseph L. Hellerstein; Sujay Parekh; Rean Griffith; Gail E. Kaiser; Dan B. Phung

The high cost of operating large computing installations has motivated a broad interest in reducing the need for human intervention by making systems self-managing. This paper explores the extent to which control theory can provide an architectural and analytic foundation for building self-managing systems. Control theory provides a rich set of methodologies for building automated self-diagnosis and self-repairing systems with properties such as stability, short settling times, and accurate regulation. However, there are challenges in applying control theory to computing systems, such as developing effective resource models, handling sensor delays, and addressing lead times in effector actions. We propose a deployable testbed for autonomic computing (DTAC) that we believe will reduce the barriers to addressing research problems in applying control theory to computing systems. The initial DTAC architecture is described along with several problems that it can be used to investigate.

Computer Networks | 2001

A statistical approach to predictive detection

Joseph L. Hellerstein; Fan Zhang; Perwez Shahabuddin

Abstract Service providers typically define quality of service problems using threshold tests, such as “Are HTTP operations greater than 12 per second on server XYZ?” Herein, we estimate the probability of threshold violations for specific times in the future. We model the threshold metric (e.g., HTTP operations per second) at two levels: (1) non-stationary behavior (as is done in workload forecasting for capacity planning) and (2) stationary, time-serial dependencies. Our approach is assessed using simulation experiments and measurements of a production Web server. For both assessments, the probabilities of threshold violations produced by our approach lie well within two standard deviations of the measured fraction of threshold violations.

Explore More