Anirban Mandal
University of North Carolina at Chapel Hill
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anirban Mandal.
cluster computing and the grid | 2005
Jim Blythe; Sonal Jain; Ewa Deelman; Yolanda Gil; Karan Vahi; Anirban Mandal; Ken Kennedy
Grid applications require allocating a large number of heterogeneous tasks to distributed resources. A good allocation is critical for efficient execution. However, many existing grid toolkits use matchmaking strategies that do not consider overall efficiency for the set of tasks to be run. We identify two families of resource allocation algorithms: task-based algorithms, that greedily allocate tasks to resources, and workflow-based algorithms, that search for an efficient allocation for the entire workflow. We compare the behavior of workflow-based algorithms and task-based algorithms, using simulations of workflows drawn from a real application and with varying ratios of computation cost to data transfer cost. We observe that workflow-based approaches have a potential to work better for data-intensive applications even when estimates about future tasks are inaccurate.
International Journal of Parallel Programming | 2005
Fran Berman; Henri Casanova; Andrew A. Chien; Keith D. Cooper; Holly Dail; Anshuman Dasgupta; W. Deng; Jack J. Dongarra; Lennart Johnsson; Ken Kennedy; Charles Koelbel; Bo Liu; Xin Liu; Anirban Mandal; Gabriel Marin; Mark Mazina; John M. Mellor-Crummey; Celso L. Mendes; A. Olugbile; Jignesh M. Patel; Daniel A. Reed; Zhiao Shi; Otto Sievert; Huaxia Xia; A. YarKhan
The goal of the Grid Application Development Software (GrADS) Project is to provide programming tools and an execution environment to ease program development for the Grid. This paper presents recent extensions to the GrADS software framework: a new approach to scheduling workflow computations, applied to a 3-D image reconstruction application; a simple stop/migrate/restart approach to rescheduling Grid applications, applied to a QR factorization benchmark; and a process-swapping approach to rescheduling, applied to an N-body simulation. Experiments validating these methods were carried out on both the GrADS MacroGrid (a small but functional Grid) and the MicroGrid (a controlled emulation of the Grid).
high performance distributed computing | 2005
Anirban Mandal; Ken Kennedy; Charles Koelbel; Gabriel Marin; John M. Mellor-Crummey; Bo Liu; S. Lennart Johnsson
In this work, we describe new strategies for scheduling and executing workflow applications on grid resources using the GrADS [Ken Kennedy et al., 2002] infrastructure. Workflow scheduling is based on heuristic scheduling strategies that use application component performance models. The workflow is executed using a novel strategy to bind and launch the application onto heterogeneous resources. We apply these strategies in the context of executing EMAN, a bio-imaging workflow application, on the grid. The results of our experiments show that our strategy of performance model based, in-advance heuristic workflow scheduling results in 1.5 to 2.2 times better makespan than other existing scheduling strategies. This strategy also achieves optimal load balance across the different grid sites for this application.
cluster computing and the grid | 2008
Gopi Kandaswamy; Anirban Mandal; Daniel A. Reed
In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies.
international conference on future internet technologies | 2011
Yufeng Xin; Ilia Baldine; Anirban Mandal; Chris Heermann; Jeffrey S. Chase; Aydan R. Yumerefendi
Embedding virtual topologies in physical network infrastructure has been an area of active research for the future Internet and network testbeds. Virtual network embedding is also useful for linking virtual compute clusters allocated from cloud providers. Using advanced networking technologies to interconnect distributed cloud sites is a promising way to provision on-demand large-scale virtualized networked systems for production and experimental purposes. In this paper, we study the virtual topology embedding problem in a networked cloud environment, in which a number of cloud provider sites are connected by multi-domain wide-area networks that support virtual networking technology. A user submits a request for a virtual topology, and the system plans a low-cost embedding and orchestrates requests to multiple cloud providers and network transit providers to instantiate the virtual topology according to the plan. We describe an efficient heuristic algorithm design and a prototype implementation within a GENI control framework candidate called ORCA.
conference on high performance computing (supercomputing) | 2006
Daniel Nurmi; Anirban Mandal; John Brevik; Chuck Koelbel; Richard Wolski; Ken Kennedy
Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks
testbeds and research infrastructures for the development of networks and communities | 2012
Ilia Baldine; Yufeng Xin; Anirban Mandal; Paul Ruth; Chris Heerman; Jeffrey S. Chase
NSF’s GENI program seeks to enable experiments that run within virtual network topologies built-to-order from testbed infrastructure offered by multiple providers (domains). GENI is often viewed as a network testbed integration effort, but behind it is an ambitious vision for multi-domain infrastructure-as-a-service (IaaS). This paper presents ExoGENI, a new GENI testbed that links GENI to two advances in virtual infrastructure services outside of GENI: open cloud computing (OpenStack) and dynamic circuit fabrics. ExoGENI orchestrates a federation of independent cloud sites and circuit providers through their native IaaS interfaces, and links them to other GENI tools and resources.
cluster computing and the grid | 2009
Yang Zhang; Anirban Mandal; Charles Koelbel; Keith D. Cooper
Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.
international symposium on performance analysis of systems and software | 2010
Anirban Mandal; Rob Fowler; Allan Porterfield
Multi-core computers are ubiquitous and multi-socket versions dominate as nodes in compute clusters. Given the high level of parallelism inherent in processor chips, the ability of memory systems to serve a large number of concurrent memory access operations is becoming a critical performance problem. The most common model of memory performance uses just two numbers, peak bandwidth and typical access latency. We introduce concurrency as an explicit parameter of the measurement and modeling processes to characterize more accurately the complexity of memory behavior of multi-socket, multi-core systems. We present a detailed experimental multi-socket, multi-core memory study based on the PCHASE benchmark, which can vary memory loads by controlling the number of concurrent memory references per thread. The make-up and structure of the memory have a major impact on achievable bandwidth. Three discrete bottlenecks were observed at different levels of the hardware architecture: limits on the number of references outstanding per core; limits to the memory requests serviced by a single memory controller; and limits on the global memory concurrency. We use these results to build a memory performance model that ties concurrency, latency and bandwidth together to create a more accurate model of overall performance. We show that current commodity memory sub-systems cannot handle the load offered by high-end processor chips.
International Journal of High Performance Computing Applications | 2017
Ewa Deelman; Christopher D. Carothers; Anirban Mandal; Brian Tierney; Jeffrey S. Vetter; Ilya Baldin; Claris Castillo; Gideon Juve; Dariusz Król; V. E. Lynch; Benjamin Mayer; Jeremy S. Meredith; Thomas Proffen; Paul Ruth; Rafael Ferreira da Silva
Computational science is well established as the third pillar of scientific discovery and is on par with experimentation and theory. However, as we move closer toward the ability to execute exascale calculations and process the ensuing extreme-scale amounts of data produced by both experiments and computations alike, the complexity of managing the compute and data analysis tasks has grown beyond the capabilities of domain scientists. Thus, workflow management systems are absolutely necessary to ensure current and future scientific discoveries. A key research question for these workflow management systems concerns the performance optimization of complex calculation and data analysis tasks. The central contribution of this article is a description of the PANORAMA approach for modeling and diagnosing the run-time performance of complex scientific workflows. This approach integrates extreme-scale systems testbed experimentation, structured analytical modeling, and parallel systems simulation into a comprehensive workflow framework called Pegasus for understanding and improving the overall performance of complex scientific workflows.