Guillaume Aupy
École normale supérieure de Lyon
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Guillaume Aupy.
ieee international conference on high performance computing, data, and analytics | 2012
Guillaume Aupy; Anne Benoit; Yves Robert
We consider a task graph mapped on a set of homogeneous processors. We aim at minimizing the energy consumption while enforcing two constraints: a prescribed bound on the execution time (or makespan), and a reliability threshold. Dynamic voltage and frequency scaling (DVFS) is an approach frequently used to reduce the energy consumption of a schedule, but slowing down the execution of a task to save energy is decreasing the reliability of the execution. In this work, to improve the reliability of a schedule while reducing the energy consumption, we allow for the re-execution of some tasks. We assess the complexity of the tri-criteria scheduling problem (makespan, reliability, energy) of deciding which task to re-execute, and at which speed each execution of a task should be done, with two different speed models: either processors can have arbitrary speeds (CONTINUOUS model), or a processor can run at a finite number of different speeds and change its speed during a computation (VDD-HoPPING model). We propose several novel tri-criteria scheduling heuristics under the continuous speed model, and we evaluate them through a set of simulations. The two best heuristics turn out to be very efficient and complementary.
pacific rim international symposium on dependable computing | 2013
Guillaume Aupy; Anne Benoit; Thomas Herault; Yves Robert; Frédéric Vivien; Dounia Zaidouni
In this paper, we revisit traditional check pointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution), (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
international parallel and distributed processing symposium | 2015
Ana Gainaru; Guillaume Aupy; Anne Benoit; Franck Cappello; Yves Robert; Marc Snir
A significant percentage of the computing capacity of large-scale platforms is wasted because of interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts enlarge-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonnes Mira system shows that burst buffers cannot prevent congestion at all times. Consequently, I/O performances dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this paper, we analyze the effects of interference on application I/O bandwidth and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers.
Journal of Parallel and Distributed Computing | 2014
Guillaume Aupy; Yves Robert; Frédéric Vivien; Dounia Zaidouni
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide optimal algorithms to decide whether and when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow us to analytically assess the key parameters that impact the performance of fault predictors at very large scale.
arXiv: Distributed, Parallel, and Cluster Computing | 2013
Guillaume Aupy; Anne Benoit; Thomas Herault; Yves Robert; Jack J. Dongarra
This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.
pacific rim international symposium on dependable computing | 2013
Guillaume Aupy; Yves Robert; Frédéric Vivien; Dounia Zaidouni
This paper deals with the impact of fault prediction techniques on check pointing strategies. We consider fault-prediction systems that do not provide exact prediction dates, but instead time intervals during which faults are predicted to strike. These intervals dramatically complicate the analysis of the check pointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside prediction windows, whenever the size of these windows is large enough. We are able to compute the best period for any size of the prediction windows, thereby deriving the scheduling strategy that minimizes platform waste. In addition, the results of the analytical study are nicely corroborated by a comprehensive set of simulations, which demonstrate the validity of the model and the accuracy of the approach.
arXiv: Data Structures and Algorithms | 2013
Guillaume Aupy; Anne Benoit; Rami G. Melhem; Paul Renaud-Goud; Yves Robert
In this paper, we aim at minimizing the energy consumption when executing a divisible workload under a bound on the total execution time, while resilience is provided through checkpointing. We discuss several variants of this multi-criteria problem. Given the workload, we need to decide how many chunks to use, what are the sizes of these chunks, and at which speed each chunk is executed. Furthermore, since a failure may occur during the execution of a chunk, we also need to decide at which speed a chunk should be re-executed in the event of a failure. The goal is to minimize the expectation of the total energy consumption, while enforcing a deadline on the execution time, that should be met either in expectation (soft deadline), or in the worst case (hard deadline). For each problem instance, we propose either an exact solution, or a function that can be optimized numerically.
theory and practice of algorithms in computer systems | 2011
Leon Atkins; Guillaume Aupy; Daniel G. Cole; Kirk Pruhs
We consider the speed scaling problem where the quality of service objective is deadline feasibility and the power objective is temperature. In the case of batched jobs, we give a simple algorithm to compute the optimal schedule. For general instances, we give a new online algorithm, and obtain an upper bound on the competitive ratio of this algorithm that is an order of magnitude better than the best previously known bound upper bound on the competitive ratio for this problem.
european conference on parallel processing | 2013
Guillaume Aupy; Mathieu Faverge; Yves Robert; Jakub Kurzak; Piotr Luszczek; Jack J. Dongarra
This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The al- gorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter- node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm ex- hibits competitive performance with state-of-the-art QR routines on a supercom- puter called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of qual- ity software on complex and hierarchical architectures.
international parallel and distributed processing symposium | 2015
Guillaume Aupy; Anne Benoit; Henri Casanova; Yves Robert
We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last check pointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-check pointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.
Collaboration
Dive into the Guillaume Aupy's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs