Philippe Olivier Alexandre Navaux

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Philippe Olivier Alexandre Navaux is active.

Explore More

Publication

Featured researches published by Philippe Olivier Alexandre Navaux.

high-performance computer architecture | 2015

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.

ieee international conference on cloud computing technology and science | 2012

High Performance Computing in the cloud: Deployment, performance and cost efficiency

Eduardo Roloff; Matthias Diener; Alexandre Carissimi; Philippe Olivier Alexandre Navaux

High-Performance Computing (HPC) in the cloud has reached the mainstream and is currently a hot topic in the research community and the industry. The attractiveness of cloud for HPC is the capability to run large applications on powerful, scalable hardware without needing to actually own or maintain this hardware. In this paper, we conduct a detailed comparison of HPC applications running on three cloud providers, Amazon EC2, Microsoft Azure and Rackspace. We analyze three important characteristics of HPC, deployment facilities, performance and cost efficiency and compare them to a cluster of machines. For the experiments, we used the well-known NAS parallel benchmarks as an example of general scientific HPC applications to examine the computational and communication performance. Our results show that HPC applications can run efficiently on the cloud. However, care must be taken when choosing the provider, as the differences between them are large. The best cloud provider depends on the type and behavior of the application, as well as the intended usage scenario. Furthermore, our results show that HPC in the cloud can have a higher performance and cost efficiency than a traditional cluster, up to 27% and 41%, respectively.

international symposium on computers and communications | 2009

Multi-core aware process mapping and its impact on communication overhead of parallel applications

Eduardo Rocha Rodrigues; Felipe Lopes Madruga; Philippe Olivier Alexandre Navaux; Jairo Panetta

We propose an approach to reduce the execution time of applications with a steady communication pattern on clusters of multi-core processors by leveraging the asymmetry of core communication speeds. In addition to the well known fact that communication link speeds on a fixed cluster vary with processor selection, we consider one effect of multicore processor chips: link speeds vary with core selection within a single processor chip. The approach requires measuring link speeds among cluster cores as well as communication volumes and computational loads of the selected application processes. This data is fed into the Dual Recursive Bipartitioning method to obtain close to optimal application process placement on cluster cores. We apply this approach to a real world application achieving sensible execution time reduction without even recompiling source code.

high performance computing and communications | 2010

Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors

Matthias Diener; Felipe Lopes Madruga; Eduardo Rocha Rodrigues; Marco Antonio Zanata Alves; Jörg Schneider; Philippe Olivier Alexandre Navaux; Hans-Ulrich Heiss

Process placement is a technique widely used on parallel machines with heterogeneous interconnects to reduce the overall communication time. For instance, two processes which communicate frequently are mapped close to each other. Finding the optimal mapping between threads and cores in a shared-memory environment (for example, OpenMP and Pthreads) is an even more complex task due to implicit communication. In this work, we examine data sharing patterns between threads in different workloads and use those patterns in a similar way as messages are used to map processes in cluster computers. We evaluated our technique on a state-of-the-art multicore processor and achieved moderate improvements in the common case and considerable improvements in some cases, reducing execution time by up to 45%.

dependable systems and networks | 2014

Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability

Paolo Rech; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Luigi Carro

Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms

Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Alexandre Carissimi; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Jean-François Méhaut

In parallel programs, the tasks of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, process mapping is a technique that provides performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application. Different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used to obtain the mapping. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) and two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.

symposium on computer architecture and high performance computing | 2010

A Comparative Analysis of Load Balancing Algorithms Applied to a Weather Forecast Model

Eduardo Rocha Rodrigues; Philippe Olivier Alexandre Navaux; Jairo Panetta; Alvaro Luiz Fazenda; Celso L. Mendes; Laxmikant V. Kalé

Among the many reasons for load imbalance in weather forecasting models, the dynamic imbalance caused by localized variations on the state of the atmosphere is the hardest one to handle. As an example, active thunderstorms may substantially increase load at a certain time step with respect to previous time steps in an unpredictable manner – after all, tracking storms is one of the reasons for running a weather forecasting model. In this paper, we present a comparative analysis of different load balancing algorithms to deal with this kind of load imbalance. We analyze the impact of these strategies on computation and communication and the effects caused by the frequency at which the load balancer is invoked on execution time. This is done without any code modification, employing the concept of processor virtualization, which basically means that the domain is over-decomposed and the unit of rebalance is a sub-domain. With this approach, we were able to reduce the execution time of a full, real-world weather model.

international conference on parallel processing | 2012

A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems

Laércio Lima Pilla; Christiane Pousa Ribeiro; Daniel Cordeiro; Chao Mei; Abhinav Bhatele; Philippe Olivier Alexandre Navaux; François Broquedis; Jean-François Méhaut; Laxmikant V. Kalé

Multi-core compute nodes with non-uniform memory access (NUMA) are now a common architecture in the assembly of large-scale parallel machines. On these machines, in addition to the network communication costs, the memory access costs within a compute node are also asymmetric. Ignoring this can lead to an increase in the data movement costs. Therefore, to fully exploit the potential of these nodes and reduce data access costs, it becomes crucial to have a complete view of the machine topology (i.e. the compute node topology and the interconnection network among the nodes). Furthermore, the parallel application behavior has an important role in determining how to utilize the machine efficiently. In this paper, we propose a hierarchical load balancing approach to improve the performance of applications on parallel multi-core systems. We introduce NucoLB, a topology-aware load balancer that focuses on redistributing work while reducing communication costs among and within compute nodes. NucoLB takes the asymmetric memory access costs present on NUMA multi-core compute nodes, the interconnection network overheads, and the application communication patterns into account in its balancing decisions. We have implemented NucoLB using the Charm++ parallel runtime system and evaluated its performance. Results show that our load balancer improves performance up to 20% when compared to state-of-the-art load balancers on three different NUMA parallel machines.

Future Generation Computer Systems | 2010

Triva: Interactive 3D visualization for performance analysis of parallel applications

Lucas Mello Schnorr; Guillaume Huard; Philippe Olivier Alexandre Navaux

The successful execution of parallel applications in grid infrastructures depends directly on a performance analysis that takes into account the grid characteristics, such as the network topology and resources location. This paper presents Triva, a software analysis tool that implements a novel technique to visualize the behavior of parallel applications. The proposed technique explores 3D graphics to show the application behavior together with a description of the resources, highlighting communication patterns, the network topology and a visual representation of a logical organization of the resources. We have used a real grid infrastructure to execute and trace applications composed of thousands of processes.

international conference on parallel architectures and compilation techniques | 2014

kMAF: automatic kernel-level management of thread and data affinity

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux; Anselm Busse; Hans-Ulrich Heiß

One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).

Explore More