Eduardo Henrique Molina da Cruz

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eduardo Henrique Molina da Cruz is active.

Explore More

Publication

Featured researches published by Eduardo Henrique Molina da Cruz.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms

Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Alexandre Carissimi; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Jean-François Méhaut

In parallel programs, the tasks of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, process mapping is a technique that provides performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application. Different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used to obtain the mapping. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) and two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.

international conference on parallel architectures and compilation techniques | 2014

kMAF: automatic kernel-level management of thread and data affinity

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux; Anselm Busse; Hans-Ulrich Heiß

One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).

international parallel and distributed processing symposium | 2012

Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory

Eduardo Henrique Molina da Cruz; Matthias Diener; Philippe Olivier Alexandre Navaux

The communication latency between the cores in multiprocessor architectures differs depending on the memory hierarchy and the interconnections. With the increase of the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables. For this reason, it is difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this paper, we introduce a new light-weight mechanism to detect the communication pattern of threads using the Translation Look aside Buffer (TLB). Our mechanism relies entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanism with the NAS Parallel Benchmarks (NPB) and achieved an accurate representation of the communication patterns. Using the detected communication patterns, we generated thread mappings using a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3%, reducing the number of cache misses by up to 31.1%.

international parallel and distributed processing symposium | 2013

Communication-Based Mapping Using Shared Pages

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux

In current shared memory architectures, the complexity of the cache and memory hierarchies is increasing. Therefore, it is becoming more important to analyze the communication behavior of parallel applications when mapping threads to cores, to improve performance and energy efficiency. However, communication is implicit in most programming models for shared memory, which makes it difficult to detect the communication pattern between the threads in an accurate and low-overhead way. We propose a new mechanism to detect the communication pattern of shared memory applications by monitoring page table accesses. Combining this mechanism with a dynamic migration algorithm allows mapping to be performed dynamically by the operating system. We implemented our mechanism in the Linux kernel and performed experiments with applications from the NAS Parallel Benchmarks. Results show a reduction of up to 16.7% of the execution time and 63% of the cache misses, compared to the original scheduler of the operating system. Furthermore, we decrease total processor and DRAM energy consumption by up to 14.7% and 28.5%, respectively.

parallel, distributed and network-based processing | 2015

An Efficient Algorithm for Communication-Based Task Mapping

Eduardo Henrique Molina da Cruz; Matthias Diener; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux

The communication between tasks of a parallel application is an important characteristic to consider when mapping tasks to computing cores due to possible differences in communication performance. Within a machine, performance differences are introduced by the memory hierarchy, in which cache memories can be shared by groups of cores and intra-chip interconnections are faster than inter-chip interconnections. In cluster and grid systems, the network imposes an additional communication latency. By mapping tasks that communicate to cores nearby on the memory hierarchy, or to the same nodes in clusters or grids, the communication of parallel applications is optimized, leading to increased performance and energy efficiency. In the task mapping context, one of the most important aspects to be considered is the mapping algorithm, as it determines the improvements that can be achieved. Since the problem of finding the best mapping is NP-Hard, heuristics must be employed to find an approximate solution in feasible time. In this paper, we present Eager Map, a new algorithm to perform communication-based mapping that is based on a greedy grouping strategy applied hierarchically. Experimental evaluation indicates that the execution time of our algorithm is 10 times faster than the state-of-the-art, and presents higher performance improvements. Due to its low execution time and high stability, Eager Map is also suitable for online task mapping, where tasks are migrated during execution.

Performance Evaluation | 2015

Characterizing Communication and Page Usage of Parallel Applications for Thread and Data Mapping

Matthias Diener; Eduardo Henrique Molina da Cruz; Laércio Lima Pilla; Fabrice Dupros; Philippe Olivier Alexandre Navaux

Abstract The parallelism in shared-memory systems has increased significantly with the advent and evolution of multicore processors. Current systems include several multicore and multithreaded processors with Non-Uniform Memory Access (NUMA) characteristics. These architectures require the adoption of two strategies for the efficient execution of parallel applications: (i) threads sharing data should be placed in such a way in the memory hierarchy that they execute on shared caches; and (ii) a thread should have the data that it accesses placed on the NUMA node where it is executing. We refer to these techniques as thread and data mapping, respectively. Both strategies require knowledge of the application’s memory access behavior to identify the communication between threads and processes as well as their usage of memory pages. In this paper, we introduce a profiling method to establish the suitability of parallel applications for improved mappings that take the memory hierarchy into account, based on a mathematical description of their memory access behaviors. Experiments with a large set of parallel workloads that are based on a variety of parallel APIs (MPI, OpenMP, Pthreads, and MPI+OpenMP) show that most applications can benefit from improved mappings. We provide a mechanism to compute optimized thread and data mappings. Experimental results with this mechanism showed performance improvements of up to 54% (20% on average), as well as reductions of the energy consumption of up to 37% (11% on average), compared to the default mapping by the operating system. Furthermore, our results show that thread and data mapping have to be performed jointly in order to achieve optimal improvements.

Journal of Parallel and Distributed Computing | 2014

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo Henrique Molina da Cruz; Matthias Diener; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.

2010 11th Symposium on Computing Systems | 2010

Process Mapping Based on Memory Access Traces

Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux

Process mapping is a technique widely used in parallel machines to provide performance gains by improving the use of resources such as interconnections and cache memory hierarchy. The problem to find the best mapping is considered NP-Hard and, in shared memory environments, there is the additional difficulty to find the communication pattern, which is implicit and occurs through memory accesses. In this context, this work aims to improve the performance of parallel applications that use shared memory. For that, it was developed a method for analysis of the shared memory which identifies the mapping without requiring any previous knowledge of the application behavior. Applications from the NAS Parallel Benchmarks (NPB) were used in these experiments, showing performance gains of up to 42% compared to the native scheduler of the operating system

parallel, distributed and network-based processing | 2015

Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux

In parallel architectures that have a Non-Uniform Memory Access (NUMA) behavior, the mapping of memory pages to NUMA nodes influences the performance of parallel applications. In order to improve traditional data mapping policies, two basic strategies can be employed: optimizing locality or balance of memory accesses. In a locality-based policy, memory pages are mapped to nodes that access the page the most. In a balance-based policy, memory pages are mapped such that the number of memory accesses resolved by each memory controller is similar. In this paper, we perform an in-depth exploration of these data mapping policies on the performance of parallel applications. We introduce metrics that describe their memory access behavior and evaluate their suitability for data mapping. We also present new mapping policies that focus on locality, balance or both. These policies were evaluated on three different NUMA architectures with applications from the NAS-OMP and PARSEC benchmark suites. Results show that the performance improvements of each policy depend on the characteristics of the applications and machines. Choosing the wrong policy can actually hurt the performance compared to the default first-touch mapping. Compared to traditional mapping policies and to policies that only focus on either locality or balance, taking into account both locality and balance results in the highest improvements. Furthermore, it avoids the performance reduction caused by the wrong data mapping.

symposium on computer architecture and high performance computing | 2014

Optimizing Memory Locality Using a Locality-Aware Page Table

Eduardo Henrique Molina da Cruz; Matthias Diener; Marco Antonio Zanata Alves; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux

One of the main challenges for modern parallel shared-memory architectures are accesses to main memory. In current systems, the performance and energy efficiency of memory accesses depend on their locality: accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. Increasing the locality requires knowledge about how the threads of a parallel application access memory pages. With this information, pages can be migrated to the NUMA nodes that access them (data mapping), as well as threads that access the same pages can be migrated to the same node such that locality can be improved even further (thread mapping). In this paper, we propose LAPT, a mechanism to store the memory access pattern of parallel applications in the page table, which is updated by the hardware during TLB misses. This information is used by the operating system to perform an optimized thread and data mapping during the execution of the parallel application. In contrast to previous work, LAPT does not require any previous information about the behavior of the applications, or changes to the application or runtime libraries. Extensive experiments with the NAS Parallel Benchmarks (NPB) and PARSEC showed performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average).

Explore More