Is this you? Create Your Porfile

Guillaume Aupy

École normale supérieure de Lyon

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guillaume Aupy is active.

Explore More

Publication

Featured researches published by Guillaume Aupy.

ieee international conference on high performance computing, data, and analytics | 2012

Energy-aware scheduling under reliability and makespan constraints

Guillaume Aupy; Anne Benoit; Yves Robert

We consider a task graph mapped on a set of homogeneous processors. We aim at minimizing the energy consumption while enforcing two constraints: a prescribed bound on the execution time (or makespan), and a reliability threshold. Dynamic voltage and frequency scaling (DVFS) is an approach frequently used to reduce the energy consumption of a schedule, but slowing down the execution of a task to save energy is decreasing the reliability of the execution. In this work, to improve the reliability of a schedule while reducing the energy consumption, we allow for the re-execution of some tasks. We assess the complexity of the tri-criteria scheduling problem (makespan, reliability, energy) of deciding which task to re-execute, and at which speed each execution of a task should be done, with two different speed models: either processors can have arbitrary speeds (CONTINUOUS model), or a processor can run at a finite number of different speeds and change its speed during a computation (VDD-HoPPING model). We propose several novel tri-criteria scheduling heuristics under the continuous speed model, and we evaluate them through a set of simulations. The two best heuristics turn out to be very efficient and complementary.

pacific rim international symposium on dependable computing | 2013

On the Combination of Silent Error Detection and Checkpointing

Guillaume Aupy; Anne Benoit; Thomas Herault; Yves Robert; Frédéric Vivien; Dounia Zaidouni

In this paper, we revisit traditional check pointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution), (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.

international parallel and distributed processing symposium | 2015

Scheduling the I/O of HPC Applications Under Congestion

Ana Gainaru; Guillaume Aupy; Anne Benoit; Franck Cappello; Yves Robert; Marc Snir

A significant percentage of the computing capacity of large-scale platforms is wasted because of interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts enlarge-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonnes Mira system shows that burst buffers cannot prevent congestion at all times. Consequently, I/O performances dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this paper, we analyze the effects of interference on application I/O bandwidth and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers.

Journal of Parallel and Distributed Computing | 2014

Checkpointing algorithms and fault prediction

Guillaume Aupy; Yves Robert; Frédéric Vivien; Dounia Zaidouni

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide optimal algorithms to decide whether and when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow us to analytically assess the key parameters that impact the performance of fault predictors at very large scale.

arXiv: Distributed, Parallel, and Cluster Computing | 2013

Optimal Checkpointing Period: Time vs. Energy

Guillaume Aupy; Anne Benoit; Thomas Herault; Yves Robert; Jack J. Dongarra

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.

pacific rim international symposium on dependable computing | 2013

Checkpointing Strategies with Prediction Windows

Guillaume Aupy; Yves Robert; Frédéric Vivien; Dounia Zaidouni

This paper deals with the impact of fault prediction techniques on check pointing strategies. We consider fault-prediction systems that do not provide exact prediction dates, but instead time intervals during which faults are predicted to strike. These intervals dramatically complicate the analysis of the check pointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside prediction windows, whenever the size of these windows is large enough. We are able to compute the best period for any size of the prediction windows, thereby deriving the scheduling strategy that minimizes platform waste. In addition, the results of the analytical study are nicely corroborated by a comprehensive set of simulations, which demonstrate the validity of the model and the accuracy of the approach.

arXiv: Data Structures and Algorithms | 2013

Energy-aware checkpointing of divisible tasks with soft or hard deadlines

Guillaume Aupy; Anne Benoit; Rami G. Melhem; Paul Renaud-Goud; Yves Robert

In this paper, we aim at minimizing the energy consumption when executing a divisible workload under a bound on the total execution time, while resilience is provided through checkpointing. We discuss several variants of this multi-criteria problem. Given the workload, we need to decide how many chunks to use, what are the sizes of these chunks, and at which speed each chunk is executed. Furthermore, since a failure may occur during the execution of a chunk, we also need to decide at which speed a chunk should be re-executed in the event of a failure. The goal is to minimize the expectation of the total energy consumption, while enforcing a deadline on the execution time, that should be met either in expectation (soft deadline), or in the worst case (hard deadline). For each problem instance, we propose either an exact solution, or a function that can be optimized numerically.

theory and practice of algorithms in computer systems | 2011

Speed scaling to manage temperature

Leon Atkins; Guillaume Aupy; Daniel G. Cole; Kirk Pruhs

We consider the speed scaling problem where the quality of service objective is deadline feasibility and the power objective is temperature. In the case of batched jobs, we give a simple algorithm to compute the optimal schedule. For general instances, we give a new online algorithm, and obtain an upper bound on the competitive ratio of this algorithm that is an order of magnitude better than the best previously known bound upper bound on the competitive ratio for this problem.

european conference on parallel processing | 2013

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC

Guillaume Aupy; Mathieu Faverge; Yves Robert; Jakub Kurzak; Piotr Luszczek; Jack J. Dongarra

This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The al- gorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter- node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm ex- hibits competitive performance with state-of-the-art QR routines on a supercom- puter called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of qual- ity software on complex and hierarchical architectures.

international parallel and distributed processing symposium | 2015

Scheduling Computational Workflows on Failure-Prone Platforms

Guillaume Aupy; Anne Benoit; Henri Casanova; Yves Robert

We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last check pointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-check pointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.

Explore More