Christiane Pousa Ribeiro
University of Grenoble
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Christiane Pousa Ribeiro.
symposium on computer architecture and high performance computing | 2009
Christiane Pousa Ribeiro; Jean-François Méhaut; Alexandre Carissimi; Márcio Castro; Luiz Gustavo Fernandes
Currently, parallel platforms based on large scale hierarchical shared memory multiprocessors with Non-Uniform Memory Access (NUMA) are becoming a trend in scientific High Performance Computing (HPC). Due to their memory access constraints, these platforms require a very careful data distribution. Many solutions were proposed to resolve this issue. However, most of these solutions did not include optimizations for numerical scientific data (array data structures) and portability issues. Besides, these solutions provide a restrict set of memory policies to deal with data placement. In this paper, we describe an user-level interface named Memory Affinity interface (MAi), which allows memory affinity control on Linux based cache-coherent NUMA (ccNUMA) platforms. Its main goals are, fine data control, flexibility and portability. The performance of MAi is evaluated on three ccNUMA platforms using numerical scientific HPC applications, the NAS Parallel Benchmarks and a Geophysics application. The results show important gains (up to 31\%) when compared to Linux default solution.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011
Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Alexandre Carissimi; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Jean-François Méhaut
In parallel programs, the tasks of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, process mapping is a technique that provides performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application. Different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used to obtain the mapping. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) and two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.
international conference on parallel processing | 2012
Laércio Lima Pilla; Christiane Pousa Ribeiro; Daniel Cordeiro; Chao Mei; Abhinav Bhatele; Philippe Olivier Alexandre Navaux; François Broquedis; Jean-François Méhaut; Laxmikant V. Kalé
Multi-core compute nodes with non-uniform memory access (NUMA) are now a common architecture in the assembly of large-scale parallel machines. On these machines, in addition to the network communication costs, the memory access costs within a compute node are also asymmetric. Ignoring this can lead to an increase in the data movement costs. Therefore, to fully exploit the potential of these nodes and reduce data access costs, it becomes crucial to have a complete view of the machine topology (i.e. the compute node topology and the interconnection network among the nodes). Furthermore, the parallel application behavior has an important role in determining how to utilize the machine efficiently. In this paper, we propose a hierarchical load balancing approach to improve the performance of applications on parallel multi-core systems. We introduce NucoLB, a topology-aware load balancer that focuses on redistributing work while reducing communication costs among and within compute nodes. NucoLB takes the asymmetric memory access costs present on NUMA multi-core compute nodes, the interconnection network overheads, and the application communication patterns into account in its balancing decisions. We have implemented NucoLB using the Charm++ parallel runtime system and evaluated its performance. Results show that our load balancer improves performance up to 20% when compared to state-of-the-art load balancers on three different NUMA parallel machines.
ieee international conference on high performance computing, data, and analytics | 2011
Márcio Castro; Luís Fabrício Wanderley Góes; Christiane Pousa Ribeiro; Murray Cole; Marcelo Cintra; Jean-François Méhaut
Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching application behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and resolution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile several STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux default thread mapping strategy.
Future Generation Computer Systems | 2014
Laércio Lima Pilla; Christiane Pousa Ribeiro; Pierre Coucheney; François Broquedis; Bruno Gaujal; Philippe Olivier Alexandre Navaux; Jean-François Méhaut
In this paper, we present a topology-aware load balancing algorithm for parallel multi-core machines and its proof of asymptotic convergence to an optimal solution. The algorithm, named HwTopoLB, aims to improve the application performance by reducing core idleness and communication delays. HwTopoLB was designed taking into account the properties of current parallel systems composed of multi-core compute nodes, namely their network interconnection, and their complex and hierarchical core topology. The latter comprises multiple levels of cache, and a memory subsystem with NUMA design. These systems provide high processing power at the expense of asymmetric communication costs, which can hamper the performance of parallel applications depending on their communication patterns if ignored. Our load balancing algorithm models asymmetries in terms of latencies and bandwidths, representing the distances and communication costs among hardware components. We have implemented HwTopoLB using the Charm++ Parallel Runtime System and evaluated its performance with two different benchmarks and one application. Our experimental results with HwTopoLB exhibit scalability over clustered multi-core compute nodes, and average performance improvements of 23% over execution without load balancers and 19% over the existing load balancing strategies on different multi-core systems. We propose a topology-aware load balancing algorithm for multi-core machines.The algorithm is demonstrated to converge asymptotically to the optimal solution.We model distances among hardware components in terms of latency and bandwidth.Topology-aware load balancing algorithm shows scalable performance over 256 cores.
ieee international conference on high performance computing data and analytics | 2010
Christiane Pousa Ribeiro; Márcio Castro; Jean-François Méhaut; Alexandre Carissimi
On numerical scientific High Performance Computing (HPC), Non-Uniform Memory Access (NUMA) platforms are now commonplace. On such platforms, the memory affinity management remains an important concern in order to overcome the memory wall problem. Prior solutions have presented some drawbacks such as machine dependency and a limited set of memory policies. This paper introduces Minas, a framework which provides either explicit or automatic memory affinity management with architecture abstraction for ccNUMAs. We evaluate our solution on two ccNUMA platforms using two geophysics parallel applications. The results show some performance improvements in comparison with other solutions available for Linux.
international conference on parallel and distributed systems | 2012
Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Pierre Coucheney; François Broquedis; Bruno Gaujal; Jean-François Méhaut
Current multi-core machines feature a complex and hierarchical core topology, multiple levels of cache and memory subsystem with NUMA design. Although this design provides high processing power to parallel machines, it comes with the cost of asymmetric memory access latencies. Depending on the parallel application communication patterns, this asymmetry may reduce the overall performance of the system. Therefore, to achieve scalable performance in this environment, it becomes crucial to exploit the machine architecture while taking into account the application communication patterns. In this paper, we introduce a topology-aware load balancing algorithm named HWTOPOLB. It combines the machine topology characteristics with the communication patterns of the application to equalize the application load on the available cores while reducing latencies. We also present the proof that the algorithm is asymptotically optimal (Theorem 1). We have implemented our load balancing algorithm using the CHARM++ Parallel System and analyzed its performance using three different benchmarks. Our experimental results show that the HWTOPOLB can achieve average performance improvements of 24% when compared to existing load balancing strategies on three different multi-core machines.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Christiane Pousa Ribeiro; Jean-François Méhaut; Alexandre Carissimi
Nowadays, on Multi-core Multiprocessors with Hierarchical Memory (Non-Uniform Memory Access (NUMA) characteristics), the number of cores accessing memory banks is considerably high. Such accesses produce stress on the memory banks, generating load-balancing issues, memory contention and remote accesses. In this context, how to manage memory accesses in an efficient fashion remains an important concern. To reduce memory access costs, developers have to manage data placement on their application assuring memory affinity. The problem is: How to guarantee memory affinity for different applications/NUMA platforms and assure efficiency, portability, minimal or none source code changes (transparency) and fine control of memory access patterns? In this Thesis, our research have led to the proposal of Minas: an efficient and portable memory affinity management framework for NUMA platforms. Minas provides both explicit memory affinity management and automatic one with good performance, architecture abstraction, minimal or none application source code modifications and fine control. We have evaluated its efficiency and portability by performing some experiments with numerical scientific HPC applications on NUMA platforms. The results have been compared with other solutions to manage memory affinity.
parallel computing | 2008
Fabrice Dupros; Christiane Pousa Ribeiro; Alexandre Carissimi; Jean-François Méhaut
Simulation of large scale seismic wave propagation is an important tool in seismology for efficient strong motion analysis and risk mitigation. Being particularly CPU-consuming, this three-dimensional problem makes use of parallel computing to improve the performance and the accuracy of the simulations. The trend in parallel computing is to increase the number of cores available at the shared-memory level with possible non-uniform cost of memory accesses. We therefore need to consider new approaches more suitable to such parallel systems. In this paper, we firstly report on the impact of memory affinity on the parallel performance of seismic simulations. We introduce a methodology combining efficient thread scheduling and careful data placement to overcome the limitation coming from both the parallel algorithm and the memory hierarchy. The MAi (Memory Affinity interface) is used to smoothly adapt the memory policy to the underlying architecture. We evaluate our methodology on computing nodes with different NUMA characteristics. A maximum gain of 53% is reported in comparison with a classical OpenMP implementation.
parallel computing | 2009
Fabrice Dupros; Christiane Pousa Ribeiro; Alexandre Carissimi; Jean-François Méhaut
Collaboration
Dive into the Christiane Pousa Ribeiro's collaboration.
Philippe Olivier Alexandre Navaux
Universidade Federal do Rio Grande do Sul
View shared research outputsCarlos Augusto Paiva da Silva Martins
Pontifícia Universidade Católica de Minas Gerais
View shared research outputsEduardo Henrique Molina da Cruz
Universidade Federal do Rio Grande do Sul
View shared research outputs