Is this you? Create Your Porfile

Daniel Nurmi

University of California, Santa Barbara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Nurmi is active.

Explore More

Publication

Featured researches published by Daniel Nurmi.

european conference on parallel processing | 2005

Modeling machine availability in enterprise and wide-area distributed computing environments

Daniel Nurmi; John Brevik; Richard Wolski

In this paper, we consider the problem of modeling machine availability in enterprise-area and wide-area distributed computing settings. Using availability data gathered from three different environments, we detail the suitability of four potential statistical distributions for each data set: exponential, Pareto, Weibull, and hyperexponential. In each case, we use software we have developed to determine the necessary parameters automatically from each data collection. To gauge suitability, we present both graphical and statistical evaluations of the accuracy with each distribution fits each data set. For all three data sets, we find that a hyperexponential model fits slightly more accurately than a Weibull, but that both are substantially better choices than either an exponential or Pareto. These results indicate that either a hyperexponential or Weibull model effectively represents machine availability in enterprise and Internet computing environments.

acm sigplan symposium on principles and practice of parallel programming | 2006

Predicting bounds on queuing delay for batch-scheduled parallel machines

John Brevik; Daniel Nurmi; Richard Wolski

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a users job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers.Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time.

conference on high performance computing (supercomputing) | 2006

Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Daniel Nurmi; Anirban Mandal; John Brevik; Chuck Koelbel; Richard Wolski; Ken Kennedy

Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks

grid computing | 2006

Fault-aware scheduling for Bag-of-Tasks applications on Desktop Grids

Cosimo Anglano; John Brevik; Massimo Canonico; Daniel Nurmi; Richard Wolski

Desktop grids have proved to be a suitable platform for the execution of bag-of-tasks applications but, being characterized by a high resource volatility, require the availability of scheduling techniques able to effectively deal with resource failures and/or unplanned periods of unavailability. In this paper we present a set of fault-aware scheduling policies that, rather than just tolerating faults as done by traditional fault-tolerant schedulers, exploit the information concerning resource availability to improve application performance. The performance of these strategies have been compared via simulation with those attained by traditional fault-tolerant schedulers. Our results, obtained by considering a set of realistic scenarios modeled after real desktop grids, show that our approach results in better application performance and resource utilization

international symposium on mixed and augmented reality | 2003

ARWin - a desktop augmented reality Window Manager

S. Di Verdi; Daniel Nurmi; Tobias Höllerer

We present ARWin, a single user 3D augmented reality desktop. We explain our design considerations and system architecture and discuss a variety of applications and interaction techniques designed to take advantage of this new platform.

international parallel and distributed processing symposium | 2008

Enabling personal clusters on demand for batch resources using commodity software

Yang-Suk Kee; Carl Kesselman; Daniel Nurmi; Richard Wolski

Providing QoS (quality of service) in batch resources against the uncertainty of resource availability due to the space-sharing nature of scheduling policies is a critical capability required for high-performance computing. This paper introduces a technique called personal cluster which reserves a partition of batch resources on users demand in a best-effort manner. A personal cluster provides a private cluster dedicated to the user during a user-specified time period by installing a user-level resource manager on the resource partition. This technique not only enables cost-effective resource utilization and efficient task management but also provides the user a uniform interface to heterogeneous resources regardless of local resource management software. A prototype implementation using a PBS batch resource manager and Globus Toolkits based on Web services shows that the overhead of instantiating a personal cluster of medium size is small, which is just about 1 minute for a personal cluster having 32 processors.

ieee international symposium on workload characterization | 2006

Predicting Bounds on Queuing Delay in Space-shared Computing Environments

John Brevik; Daniel Nurmi; Richard Wolski

Most space-sharing resources presently operated by high performance computing centers employ some sort of batch queueing system to manage resource allocation to multiple users. In this work, we explore a new method for providing end-users with predictions of the bounds on queuing delay individual jobs will experience when waiting to be scheduled to a machine partition. We evaluate this method using scheduler logs that cover a 10 year period from 10 large HPC systems. Our results show that it is possible to predict delay bounds with specified confidence levels for jobs in different queues, and for jobs requesting different ranges of processor counts

international conference on cluster computing | 2005

Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments

Daniel Nurmi; John Brevik; Richard Wolski

Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkpointed before the desktop resources are reclaimed by their owners and the job is evacuated. In this paper, we investigate a new system for computing efficient checkpoint schedules in cycle-harvesting environments. Our system records the historical availability from each resource and fits a statistical model to the observations. Because checkpointing must often traverse the network (i.e. the desktop hosts do not provide sufficient persistent storage for checkpoints), we combine this model with predictions of network performance to the storage site to compute a checkpoint schedule. When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the applications execution, evaluates the expected time and network overhead as a function of the checkpoint interval, and numerically optimizes with respect to time. We report on the performance of and implementation of this system using the Condor cycle-harvesting environment at the University of Wisconsin. We also evaluate the efficiencies we achieve for a variety of network overheads using trace-based simulation. Finally, we validate our simulations against the observed performance with Condor. Our results indicate that while the choice of model distribution has a relatively small but positive effect on time efficiency, it has a substantial impact on network utilization

Journal of Parallel and Distributed Computing | 2011

Deadline-sensitive workflow orchestration without explicit resource control

Lavanya Ramakrishnan; Jeffrey S. Chase; Dennis Gannon; Daniel Nurmi; Richard Wolski

Deadline-sensitive workflows require careful coordination of user constraints with resource availability. Current distributed resource access models provide varying degrees of resource control: from limited or none in grid batch systems to explicit in cloud systems. Additionally applications experience variability due to competing user loads, performance variations, failures, etc. These variations impact the quality of service (QoS) that goes unaccounted for in planning strategies. In this paper we propose Workflow ORchestrator for Distributed Systems (WORDS) architecture based on a least common denominator resource model that abstracts the differences and captures the QoS properties provided by grid and cloud systems. We investigate algorithms for effective orchestration (i.e., resource procurement and task mapping) for deadline-sensitive workflows atop the resource abstraction provided in WORDS. Our evaluation compares orchestration methodologies over TeraGrid and Amazon EC2 systems. Experimental results show that WORDS enables effective orchestration possible at reasonable costs on batch queue grid and cloud systems with or without explicit resource control.

acm sigplan symposium on principles and practice of parallel programming | 2008

Probabilistic advanced reservations for batch-scheduled parallel machines

Daniel Nurmi; Richard Wolski; John Brevik

In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turn-around times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is to allow users who are willing to plan ahead to make “advanced reservations” for processor resources. To date, however, few if any HPC centers provide an advanced reservation capability to their general user populations for fear (supported by previous research) that diminished machine utilization will occur if and when advanced reservations are introduced. In this work, we describe VARQ, a new method for job scheduling that provides users with probabilistic “virtual” advanced reservations using only existing best effort batch schedulers and policies. VARQ functions as an overlay, submitting jobs that are indistinguishable from the normal workload serviced by a scheduler. We describe the statistical methods we use to implement VARQ, detail an empirical evaluation of its effectiveness in a number of HPC settings, and explore the potential future impact of VARQ should it become widely used. Without requiring HPC sites to support advanced reservations, we find that VARQ can implement a reservation capability probabilistically and that the effects of this probabilistic approach are unlikely to negatively affect resource utilization.

Explore More