Marco Antonio Zanata Alves

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marco Antonio Zanata Alves is active.

Explore More

Publication

Featured researches published by Marco Antonio Zanata Alves.

high performance computing and communications | 2010

Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors

Matthias Diener; Felipe Lopes Madruga; Eduardo Rocha Rodrigues; Marco Antonio Zanata Alves; Jörg Schneider; Philippe Olivier Alexandre Navaux; Hans-Ulrich Heiss

Process placement is a technique widely used on parallel machines with heterogeneous interconnects to reduce the overall communication time. For instance, two processes which communicate frequently are mapped close to each other. Finding the optimal mapping between threads and cores in a shared-memory environment (for example, OpenMP and Pthreads) is an even more complex task due to implicit communication. In this work, we examine data sharing patterns between threads in different workloads and use those patterns in a similar way as messages are used to map processes in cluster computers. We evaluated our technique on a state-of-the-art multicore processor and achieved moderate improvements in the common case and considerable improvements in some cases, reducing execution time by up to 45%.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms

Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Alexandre Carissimi; Philippe Olivier Alexandre Navaux; Christiane Pousa Ribeiro; Jean-François Méhaut

In parallel programs, the tasks of a given application must cooperate in order to accomplish the required computation. However, the communication time between the tasks may be different depending on which core they are executing and how the memory hierarchy and interconnection are used. The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. In this context, process mapping is a technique that provides performance gains by improving the use of resources such as interconnections, main memory and cache memory. The problem of detecting the best mapping is considered NP-Hard. Furthermore, in shared memory environments, there is an additional difficulty of finding the communication pattern, which is implicit and occurs through memory accesses. This work aims to provide a method for static mapping for NUMA architectures which does not require any prior knowledge of the application. Different metrics were adopted and an heuristic method based on the Edmonds matching algorithm was used to obtain the mapping. In order to evaluate our proposal, we use the NAS Parallel Benchmarks (NPB) and two modern multi-core NUMA machines. Results show performance gains of up to 75% compared to the native scheduler and memory allocator of the operating system.

Journal of Parallel and Distributed Computing | 2014

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo Henrique Molina da Cruz; Matthias Diener; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.

parallel, distributed and network-based processing | 2010

Impact of Parallel Workloads on NoC Architecture Design

Henrique C. Freitas; Lucas Mello Schnorr; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux

Due to the multi-core processors, the importance of parallel workloads has increased considerably. However, many-core chips demand new interconnection strategies, since traditional crossbars or buses, common for current multi-core processors, have problems related to wires and scalability. For this reason, Networks-on-Chip (NoCs) have been developed in order to support the performance and parallelism focused on several workloads. Although a Network-on-Chip is a good option, most designs consist of a large number of routers. These routers are responsible for forwarding packets, and consequently, for supporting message-passing workloads. In this context, the NoC performance is a problem. Therefore, the main goal of this paper is to evaluate the impact of well-known parallel workloads on NoC architecture design. In order to achieve high performance, the results point out to parallel workloads with small packets and cluster-based NoCs with circuit switching and adaptable topologies.

high performance computing and communications | 2015

SiNUCA: A Validated Micro-Architecture Simulator

Marco Antonio Zanata Alves; Carlos Villavieja; Matthias Diener; Francis Birck Moreira; Philippe Olivier Alexandre Navaux

In order to observe and understand the architectural behavior of applications and evaluate new techniques, computer architects often use simulation tools. Several cycle-accurate simulators have been proposed to simulate the operation of the processor on the micro-architectural level. However, an important step before adopting a simulator is its validation, in order to determine how accurate the simulator is compared to a real machine. This validation step is often neglected with the argument that only the industry possesses the implementation details of the architectural components. The lack of publicly available micro-benchmarks that are capable of providing insights about the processor implementation is another barrier. In this paper, we present the validation of a new cycle-accurate, trace-driven simulator, SiNUCA. To perform the validation, we introduce a new set of micro-benchmarks to evaluate the performance of architectural components. SiNUCA provides a controlled environment to simulate the micro-architecture inside the cores, the cache memory sub-system with multi-banked caches, a NoC interconnection and a detailed memory controller. Using our micro-benchmarks, we present a simulation validation comparing the performance of real Core 2 Duo and Sandy-Bridge processors, achieving an average performance error of less than 9%.

symposium on computer architecture and high performance computing | 2012

Energy Savings via Dead Sub-Block Prediction

Marco Antonio Zanata Alves; Khubaib; Eiman Ebrahimi; Veynu Narasiman; Carlos Villavieja; Philippe Olivier Alexandre Navaux; Yale N. Patt

Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e., never accessed again). This results in considerable energy waste since (1) data not needed by the processor is brought into the cache, and (2) data is kept alive in the cache longer than necessary. We propose the Dead Sub-Block Predictor (DSBP) to predict which sub-blocks of a cache line will be actually used and how many times it will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are touched the predicted number of times. We also use DSBP to identify dead lines (i.e., all sub-blocks off) and augment the existing replacement policy by prioritizing dead lines for eviction. Our results show a 24% energy reduction for the whole cache hierarchy when averaged over the SPEC2000, SPEC2006 and NAS-NPB benchmarks.

2010 11th Symposium on Computing Systems | 2010

Process Mapping Based on Memory Access Traces

Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux

Process mapping is a technique widely used in parallel machines to provide performance gains by improving the use of resources such as interconnections and cache memory hierarchy. The problem to find the best mapping is considered NP-Hard and, in shared memory environments, there is the additional difficulty to find the communication pattern, which is implicit and occurs through memory accesses. In this context, this work aims to improve the performance of parallel applications that use shared memory. For that, it was developed a method for analysis of the shared memory which identifies the mapping without requiring any previous knowledge of the application behavior. Applications from the NAS Parallel Benchmarks (NPB) were used in these experiments, showing performance gains of up to 42% compared to the native scheduler of the operating system

symposium on computer architecture and high performance computing | 2014

Optimizing Memory Locality Using a Locality-Aware Page Table

Eduardo Henrique Molina da Cruz; Matthias Diener; Marco Antonio Zanata Alves; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux

One of the main challenges for modern parallel shared-memory architectures are accesses to main memory. In current systems, the performance and energy efficiency of memory accesses depend on their locality: accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. Increasing the locality requires knowledge about how the threads of a parallel application access memory pages. With this information, pages can be migrated to the NUMA nodes that access them (data mapping), as well as threads that access the same pages can be migrated to the same node such that locality can be improved even further (thread mapping). In this paper, we propose LAPT, a mechanism to store the memory access pattern of parallel applications in the page table, which is updated by the hardware during TLB misses. This information is used by the operating system to perform an optimized thread and data mapping during the execution of the parallel application. In contrast to previous work, LAPT does not require any previous information about the behavior of the applications, or changes to the application or runtime libraries. Extensive experiments with the NAS Parallel Benchmarks (NPB) and PARSEC showed performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average).

design, automation, and test in europe | 2016

Large vector extensions inside the HMC

Marco Antonio Zanata Alves; Matthias Diener; Paulo C. Santos; Luigi Carro

One of the main challenges for embedded systems is the transfer of data between memory and processor. In this context, Hybrid Memory Cubes (HMCs) can provide substantial energy and bandwidth improvements compared to traditional memory organizations, while also allowing the execution of simple atomic instructions in the memory. However, the complex memory hierarchy still remains a bottleneck, especially for applications with a low reuse of data, limiting the usable parallelism of the HMC vaults and banks. In this paper, we introduce the HIVE architecture, which allows performing common vector operations directly inside the HMC, avoiding contention on the interconnections as well as cache pollution. Our mechanism achieves substantial speedups of up to 17.3× (9.4× on average) compared to a baseline system that performs vector operations in a 8-core processor. We show that the simple instructions provided by HMC actually hurt performance for streaming applications.

networks on chips | 2009

Performance Evaluation of NoC Architectures for Parallel Workloads

Henrique C. Freitas; Marco Antonio Zanata Alves; Lucas Mello Schnorr; Philippe Olivier Alexandre Navaux

Network-on-Chip is the state-of-the-art approach to interconnect many processing cores in the next generation of general-purpose processors. In this context, the problem is to choose NoC architectures capable of achieving high performance for parallel programs. Therefore, the main goal of this paper is to evaluate the performance of three NoC architectures using well-known parallel workloads.

Explore More