Christiane Pousa Ribeiro

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christiane Pousa Ribeiro is active.

Explore More

Publication

Featured researches published by Christiane Pousa Ribeiro.

symposium on computer architecture and high performance computing | 2009

Memory Affinity for Hierarchical Shared Memory Multiprocessors

Christiane Pousa Ribeiro; Jean-François Méhaut; Alexandre Carissimi; Márcio Castro; Luiz Gustavo Fernandes

Currently, parallel platforms based on large scale hierarchical shared memory multiprocessors with Non-Uniform Memory Access (NUMA) are becoming a trend in scientific High Performance Computing (HPC). Due to their memory access constraints, these platforms require a very careful data distribution. Many solutions were proposed to resolve this issue. However, most of these solutions did not include optimizations for numerical scientific data (array data structures) and portability issues. Besides, these solutions provide a restrict set of memory policies to deal with data placement. In this paper, we describe an user-level interface named Memory Affinity interface (MAi), which allows memory affinity control on Linux based cache-coherent NUMA (ccNUMA) platforms. Its main goals are, fine data control, flexibility and portability. The performance of MAi is evaluated on three ccNUMA platforms using numerical scientific HPC applications, the NAS Parallel Benchmarks and a Geophysics application. The results show important gains (up to 31\%) when compared to Linux default solution.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms

Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Alexandre Carissimi; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Jean-François Méhaut

In parallel programs, the tasks of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, process mapping is a technique that provides performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application. Different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used to obtain the mapping. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) and two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.

international conference on parallel processing | 2012

A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems

Laércio Lima Pilla; Christiane Pousa Ribeiro; Daniel Cordeiro; Chao Mei; Abhinav Bhatele; Philippe Olivier Alexandre Navaux; François Broquedis; Jean-François Méhaut; Laxmikant V. Kalé

Multi-core compute nodes with non-uniform memory access (NUMA) are now a common architecture in the assembly of large-scale parallel machines. On these machines, in addition to the network communication costs, the memory access costs within a compute node are also asymmetric. Ignoring this can lead to an increase in the data movement costs. Therefore, to fully exploit the potential of these nodes and reduce data access costs, it becomes crucial to have a complete view of the machine topology (i.e. the compute node topology and the interconnection network among the nodes). Furthermore, the parallel application behavior has an important role in determining how to utilize the machine efficiently. In this paper, we propose a hierarchical load balancing approach to improve the performance of applications on parallel multi-core systems. We introduce NucoLB, a topology-aware load balancer that focuses on redistributing work while reducing communication costs among and within compute nodes. NucoLB takes the asymmetric memory access costs present on NUMA multi-core compute nodes, the interconnection network overheads, and the application communication patterns into account in its balancing decisions. We have implemented NucoLB using the Charm++ parallel runtime system and evaluated its performance. Results show that our load balancer improves performance up to 20% when compared to state-of-the-art load balancers on three different NUMA parallel machines.

ieee international conference on high performance computing, data, and analytics | 2011

A machine learning-based approach for thread mapping on transactional memory applications

Márcio Castro; Luís Fabrício Wanderley Góes; Christiane Pousa Ribeiro; Murray Cole; Marcelo Cintra; Jean-François Méhaut

Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching application behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and resolution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile several STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux default thread mapping strategy.

Future Generation Computer Systems | 2014

A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Laércio Lima Pilla; Christiane Pousa Ribeiro; Pierre Coucheney; François Broquedis; Bruno Gaujal; Philippe Olivier Alexandre Navaux; Jean-François Méhaut

In this paper, we present a topology-aware load balancing algorithm for parallel multi-core machines and its proof of asymptotic convergence to an optimal solution. The algorithm, named HwTopoLB, aims to improve the application performance by reducing core idleness and communication delays. HwTopoLB was designed taking into account the properties of current parallel systems composed of multi-core compute nodes, namely their network interconnection, and their complex and hierarchical core topology. The latter comprises multiple levels of cache, and a memory subsystem with NUMA design. These systems provide high processing power at the expense of asymmetric communication costs, which can hamper the performance of parallel applications depending on their communication patterns if ignored. Our load balancing algorithm models asymmetries in terms of latencies and bandwidths, representing the distances and communication costs among hardware components. We have implemented HwTopoLB using the Charm++ Parallel Runtime System and evaluated its performance with two different benchmarks and one application. Our experimental results with HwTopoLB exhibit scalability over clustered multi-core compute nodes, and average performance improvements of 23% over execution without load balancers and 19% over the existing load balancing strategies on different multi-core systems. We propose a topology-aware load balancing algorithm for multi-core machines.The algorithm is demonstrated to converge asymptotically to the optimal solution.We model distances among hardware components in terms of latency and bandwidth.Topology-aware load balancing algorithm shows scalable performance over 256 cores.

ieee international conference on high performance computing data and analytics | 2010

Improving memory affinity of geophysics applications on NUMA platforms using minas

Christiane Pousa Ribeiro; Márcio Castro; Jean-François Méhaut; Alexandre Carissimi

On numerical scientific High Performance Computing (HPC), Non-Uniform Memory Access (NUMA) platforms are now commonplace. On such platforms, the memory affinity management remains an important concern in order to overcome the memory wall problem. Prior solutions have presented some drawbacks such as machine dependency and a limited set of memory policies. This paper introduces Minas, a framework which provides either explicit or automatic memory affinity management with architecture abstraction for ccNUMAs. We evaluate our solution on two ccNUMA platforms using two geophysics parallel applications. The results show some performance improvements in comparison with other solutions available for Linux.

international conference on parallel and distributed systems | 2012

Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems

Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Pierre Coucheney; François Broquedis; Bruno Gaujal; Jean-François Méhaut

Current multi-core machines feature a complex and hierarchical core topology, multiple levels of cache and memory subsystem with NUMA design. Although this design provides high processing power to parallel machines, it comes with the cost of asymmetric memory access latencies. Depending on the parallel application communication patterns, this asymmetry may reduce the overall performance of the system. Therefore, to achieve scalable performance in this environment, it becomes crucial to exploit the machine architecture while taking into account the application communication patterns. In this paper, we introduce a topology-aware load balancing algorithm named HWTOPOLB. It combines the machine topology characteristics with the communication patterns of the application to equalize the application load on the available cores while reducing latencies. We also present the proof that the algorithm is asymptotically optimal (Theorem 1). We have implemented our load balancing algorithm using the CHARM++ Parallel System and analyzed its performance using three different benchmarks. Our experimental results show that the HWTOPOLB can achieve average performance improvements of 24% when compared to existing load balancing strategies on three different multi-core machines.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Memory affinity management for numerical scientific applications over Multi-core Multiprocessors with Hierarchical Memory

Christiane Pousa Ribeiro; Jean-François Méhaut; Alexandre Carissimi

Nowadays, on Multi-core Multiprocessors with Hierarchical Memory (Non-Uniform Memory Access (NUMA) characteristics), the number of cores accessing memory banks is considerably high. Such accesses produce stress on the memory banks, generating load-balancing issues, memory contention and remote accesses. In this context, how to manage memory accesses in an efficient fashion remains an important concern. To reduce memory access costs, developers have to manage data placement on their application assuring memory affinity. The problem is: How to guarantee memory affinity for different applications/NUMA platforms and assure efficiency, portability, minimal or none source code changes (transparency) and fine control of memory access patterns? In this Thesis, our research have led to the proposal of Minas: an efficient and portable memory affinity management framework for NUMA platforms. Minas provides both explicit memory affinity management and automatic one with good performance, architecture abstraction, minimal or none application source code modifications and fine control. We have evaluated its efficiency and portability by performing some experiments with numerical scientific HPC applications on NUMA platforms. The results have been compared with other solutions to manage memory affinity.

parallel computing | 2008

Parallel Seismic Wave Propagation on NUMA architectures

Fabrice Dupros; Christiane Pousa Ribeiro; Alexandre Carissimi; Jean-François Méhaut

Simulation of large scale seismic wave propagation is an important tool in seismology for efficient strong motion analysis and risk mitigation. Being particularly CPU-consuming, this three-dimensional problem makes use of parallel computing to improve the performance and the accuracy of the simulations. The trend in parallel computing is to increase the number of cores available at the shared-memory level with possible non-uniform cost of memory accesses. We therefore need to consider new approaches more suitable to such parallel systems. In this paper, we firstly report on the impact of memory affinity on the parallel performance of seismic simulations. We introduce a methodology combining efficient thread scheduling and careful data placement to overcome the limitation coming from both the parallel algorithm and the memory hierarchy. The MAi (Memory Affinity interface) is used to smoothly adapt the memory policy to the underlying architecture. We evaluate our methodology on computing nodes with different NUMA characteristics. A maximum gain of 53% is reported in comparison with a classical OpenMP implementation.

parallel computing | 2009