Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Marc Gamell is active.

Publication


Featured researches published by Marc Gamell.


ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Energy-Aware Application-Centric VM Allocation for HPC Workloads

Hariharasudhan Viswanathan; Eun Kyung Lee; Ivan Rodero; Dario Pompili; Manish Parashar; Marc Gamell

Virtualized data centers and clouds are being increasingly considered for traditional High-Performance Computing (HPC) workloads that have typically targeted Grids and conventional HPC platforms. However, maximizing energy efficiency, cost-effectiveness, and utilization of data center resources while ensuring performance and other Quality of Service (QoS) guarantees for HPC applications requires careful consideration of important and extremely challenging tradeoffs. An innovative application-centric energy-aware strategy for Virtual Machine (VM) allocation is presented. The proposed strategy ensures high resource utilization and energy efficiency through VM consolidation while satisfying application QoS. While existing VM allocation solutions are aimed at satisfying only the resource utilization requirements of applications along only one dimension (CPU utilization), the proposed approach is more generic as it employs knowledge obtained through application profiling along multiple dimensions. The results of our evaluation show that the proposed VM allocation strategy enables significant reduction either in energy consumption or in execution time, depending on the optimization goals.


grid computing | 2012

Energy-Efficient Thermal-Aware Autonomic Management of Virtualized HPC Cloud Infrastructure

Ivan Rodero; Hariharasudhan Viswanathan; Eun Kyung Lee; Marc Gamell; Dario Pompili; Manish Parashar

Virtualized datacenters and clouds are being increasingly considered for traditional High-Performance Computing (HPC) workloads that have typically targeted Grids and conventional HPC platforms. However, maximizing energy efficiency and utilization of datacenter resources, and minimizing undesired thermal behavior while ensuring application performance and other Quality of Service (QoS) guarantees for HPC applications requires careful consideration of important and extremely challenging tradeoffs. Virtual Machine (VM) migration is one of the most common techniques used to alleviate thermal anomalies (i.e., hotspots) in cloud datacenter servers as it reduces load and, hence, the server utilization. In this article, the benefits of using other techniques such as voltage scaling and pinning (traditionally used for reducing energy consumption) for thermal management over VM migrations are studied in detail. As no single technique is the most efficient to meet temperature/performance optimization goals in all situations, an autonomic approach that performs energy-efficient thermal management while ensuring the QoS delivered to the users is proposed. To address the problem of VM allocation that arises during VM migrations, an innovative application-centric energy-aware strategy for Virtual Machine (VM) allocation is proposed. The proposed strategy ensures high resource utilization and energy efficiency through VM consolidation while satisfying application QoS by exploiting knowledge obtained through application profiling along multiple dimensions (CPU, memory, and network bandwidth utilization). To support our arguments, we present the results obtained from an experimental evaluation on real hardware using HPC workloads under different scenarios.


ieee international conference on high performance computing data and analytics | 2014

Exploring automatic, online failure recovery for scientific applications at extreme scales

Marc Gamell; Daniel S. Katz; Hemanth Kolla; Jacqueline H. Chen; Scott Klasky; Manish Parashar

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Process/node failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felixs ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.


ieee international conference on high performance computing data and analytics | 2013

Exploring power behaviors and trade-offs of in-situ data analytics

Marc Gamell; Ivan Rodero; Manish Parashar; Janine C. Bennett; Hemanth Kolla; Jacqueline H. Chen; Peer-Timo Bremer; Aaditya G. Landge; Attila Gyulassy; Patrick S. McCormick; Scott Pakin; Valerio Pascucci; Scott Klasky

As scientific applications target exascale, challenges related to data and energy are becoming dominating concerns. For example, coupled simulation workflows are increasingly adopting in-situ data processing and analysis techniques to address costs and overheads due to data movement and I/O. However it is also critical to understand these overheads and associated trade-offs from an energy perspective. The goal of this paper is exploring data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. Specifically, this paper presents: (1) an analysis of the data-related behaviors of a combustion simulation workflow with an insitu data analytics pipeline, running on the Titan system at ORNL; (2) a power model based on system power and data exchange patterns, which is empirically validated; and (3) the use of the model to characterize the energy behavior of the workflow and to explore energy/performance tradeoffs on current as well as emerging systems.


grid computing | 2010

Towards energy-efficient reactive thermal management in instrumented datacenters

Ivan Rodero; Eun Kyung Lee; Dario Pompili; Manish Parashar; Marc Gamell; Renato J. O. Figueiredo

Virtual Machine (VM) migration is one of the most common techniques used to alleviate thermal anomalies (i.e., hotspots) in cloud datacenters servers of by reducing the load and, therefore, decreasing the server utilization. However, there are other techniques such as voltage scaling that also can be applied to reduce the temperature of the servers in datacenters. Because no single technique is the most efficient to meet temperature/performance optimization goals in all situations, we work towards an autonomic approach that performs energy-efficient thermal management while ensuring the Quality of Service (QoS) delivered to the users. In this paper, we explore ways to take actions to reduce energy consumption at the server side before performing costly migrations of VMs. Specifically, we focus on exploiting VM Monitor (VMM) configurations, such as pinning techniques in Xen platforms, which are complementary to other techniques at the physical server layer such as using low power modes. To support the arguments of our approach, we present the results obtained from an experimental evaluation on real hardware using High Performance Computing (HPC) workloads on different scenarios.


ieee international conference on high performance computing data and analytics | 2015

Practical scalable consensus for pseudo-synchronous distributed systems

Thomas Herault; Aurelien Bouteiller; George Bosilca; Marc Gamell; Keita Teranishi; Manish Parashar; Jack J. Dongarra

The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.


ieee international conference on high performance computing data and analytics | 2015

Local recovery and failure masking for stencil-based applications at extreme scales

Marc Gamell; Keita Teranishi; Michael A. Heroux; Jackson R. Mayo; Hemanth Kolla; Jacqueline H. Chen; Manish Parashar

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.


ieee international conference on high performance computing, data, and analytics | 2013

Exploring energy and performance behaviors of data-intensive scientific workflows on systems with deep memory hierarchies

Marc Gamell; Ivan Rodero; Manish Parashar; Stephen W. Poole

The increasing gap between the rate at which large scale scientific simulations generate data and the corresponding storage speeds and capacities is leading to more complex system architectures with deep memory hierarchies. Advances in non-volatile memory (NVRAM) technology have made it an attractive candidate as intermediate storage in this memory hierarchy to address the latency and performance gap between main memory and disk storage. As a result, it is important to understand and model its energy/performance behavior from an application perspective as well as how it can be effectively used for staging data within an application workflow. In this paper, we target a NVRAM-based deep memory hierarchy and explore its potential for supporting in-situ/in-transit data analytics pipelines that are part of application workflows patterns. Specifically, we model the memory hierarchy and experimentally explore energy/performance behaviors of different data management strategies and data exchange patterns, as well as the tradeoffs associated with data placement, data movement and data processing.


high performance distributed computing | 2015

Exploring Failure Recovery for Stencil-based Applications at Extreme Scales

Marc Gamell; Keita Teranishi; Michael A. Heroux; Jackson R. Mayo; Hemanth Kolla; Jacqueline H. Chen; Manish Parashar

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the overheads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.


high performance distributed computing | 2012

Exploring cross-layer power management for PGAS applications on the SCC platform

Marc Gamell; Ivan Rodero; Manish Parashar; Rajeev Muralidhar

High-performance parallel computing architectures are increasingly based on multi-core processors. While current commercially available processors are at 8 and 16 cores, technological and power constraints are limiting the performance growth of the cores and are resulting in architectures with much higher core counts, such as the experimental many-core Intel Single-chip Cloud Computer (SCC) platform. These trends are presenting new sets of challenges to HPC applications including programming complexity and the need for extreme energy efficiency. In this paper, we first investigate the power behavior of scientific Partitioned Global Address Space (PGAS) application kernels on the SCC platform, and explore opportunities and challenges for power management within the PGAS framework. Results obtained via empirical evaluation of Unified Parallel C (UPC) applications on the SCC platform under different constraints, show that, for specific operations, the potential for energy savings in PGAS is large; and power/performance trade-offs can be effectively managed using a cross-layer approach. We investigate cross-layer power management using PGAS language extensions and runtime mechanisms that manipulate power/performance tradeoffs. Specifically, we present the design, implementation and evaluation of such a middleware for application-aware cross-layer power management of UPC applications on the SCC platform. Finally, based on our observations, we provide a set of insights that can be used to support similar power management for PGAS applications on other many-core platforms.

Collaboration


Dive into the Marc Gamell's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Keita Teranishi

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar

Hemanth Kolla

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jacqueline H. Chen

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar

Michael A. Heroux

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar

Jackson R. Mayo

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge