Paulo C. Santos
Universidade Federal do Rio Grande do Sul
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paulo C. Santos.
design, automation, and test in europe | 2016
Marco Antonio Zanata Alves; Matthias Diener; Paulo C. Santos; Luigi Carro
One of the main challenges for embedded systems is the transfer of data between memory and processor. In this context, Hybrid Memory Cubes (HMCs) can provide substantial energy and bandwidth improvements compared to traditional memory organizations, while also allowing the execution of simple atomic instructions in the memory. However, the complex memory hierarchy still remains a bottleneck, especially for applications with a low reuse of data, limiting the usable parallelism of the HMC vaults and banks. In this paper, we introduce the HIVE architecture, which allows performing common vector operations directly inside the HMC, avoiding contention on the interconnections as well as cache pollution. Our mechanism achieves substantial speedups of up to 17.3× (9.4× on average) compared to a baseline system that performs vector operations in a 8-core processor. We show that the simple instructions provided by HMC actually hurt performance for streaming applications.
compilers, architecture, and synthesis for embedded systems | 2015
Marco Antonio Zanata Alves; Paulo C. Santos; Francis B. Moreira; Matthias Diener; Luigi Carro
Despite the ability of modern processors to execute a variety of algorithms efficiently through instructions based on registers with ever-increasing widths, some applications present poor performance due to the limited interconnection bandwidth between main memory and processing units. Near-data processing has started to gain acceptance as an accelerator device due to the technology constraints and high costs associated with data transfer. However, previous approaches to near-data computing do not provide general-purpose processing, or require large amounts of logic and do not fully use the potential of the DRAM devices. These issues limited its wide adoption. In this paper, we present the Memory Vector Extensions (MVX), which implement vector instructions directly inside the DRAM devices, therefore avoiding data movement between memory and processing units, while requiring a lower amount of logic than previous approaches. MVX is able to obtain up to 211× increase in performance for application kernels with a high spatial locality and a low temporal locality. Comparing to an embedded processor with 8 cores and 2 memory channels that supports AVX-512 instructions, MVX performs 24× faster on average for three well known algorithms.
design, automation, and test in europe | 2017
Paulo C. Santos; Geraldo F. Oliveira; Diego G. Tome; Marco Antonio Zanata Alves; Eduardo Cunha de Almeida; Luigi Carro
Nowadays, applications that predominantly perform lookups over large databases are becoming more popular with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidth of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing databases. However, even with the high available memory bandwidth and processing power, in order to achieve the peak performance, data movements through the memory hierarchy consumes an unnecessary amount of time and energy. In order to accelerate database operations, and reduce the energy consumption of the system, this paper presents the Reconfigurable Vector Unit (RVU) that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness. RVU enables the programmer to reconfigure it to perform as a large vector unit or multiple small vectors units to better adjust for the application needs during different computation phases. Due to its adaptability, RVU is capable of achieving performance increase of 27 χ on average and reduce the DRAM energy consumption in 29% when compared to an x86 processor with 16 cores. Compared with the state-of-the-art mechanism capable of performing large vector operations with fixed size, inside the HMC, RVU performed up to 12% better in terms of performance and improve in 53% the energy consumption.
parallel, distributed and network-based processing | 2016
Paulo C. Santos; Marco Antonio Zanata Alves; Matthias Diener; Luigi Carro; Philippe Olivier Alexandre Navaux
One of the main challenges for computer architects is how to hide the high average memory access latency from the processor. In this context, Hybrid Memory Cubes (HMCs) can provide substantial energy and bandwidth improvements compared to traditional memory organizations. However, it is not clear how this reduced average memory access latency will impact the LLC. For applications with high cache miss ratios, the latency to search for the data inside the cache memory will impact negatively on the performance. The importance of this overhead depends on the memory access latency. In this paper, we present an evaluation of the L3 cache importance on a high performance processor using HMC also exploring chip area tradeoffs between the cache size and number of processor cores. We show that the high bandwidth provided by HMC memories can eliminate the need for L3 caches, removing hardware and making room for more processing power. Our evaluations show that performance increased 37% and the EDP improved 12% while maintaining the same original chip area in a wide range of parallel applications, when compared to DDR3 memories.
applied reconfigurable computing | 2017
Geraldo F. Oliveira; Paulo C. Santos; Marco Antonio Zanata Alves; Luigi Carro
Neuron Network simulation has arrived as a methodology to help one solve computational problems by mirroring behavior. However, to achieve consistent simulation results, large sets of workloads need to be evaluated. In this work, we present a neural in-memory simulator capable of executing deep learning applications inside 3D-stacked memories. With the reduction of data movement and by including a simple accelerator layer near to memory, our system was able to overperform traditional multi-core devices, while reducing overall system energy consumption.
Proceedings of the 2015 International Symposium on Memory Systems | 2015
Marco Antonio Zanata Alves; Paulo C. Santos; Matthias Diener; Luigi Carro
In order to overcome the low memory bandwidth and the high energy costs associated with the data transfer between the processor and the main memory, proposals on near-data computing started to gain acceptance in systems ranging from embedded architectures to high performance computing. The main previous approaches propose application specific hardware or require a large amount of logic. Moreover, most proposals require algorithm changes and do not make use of the full parallelism available on the DRAM devices. These issues limits the adoption and the performance of near-data computing. In this paper, we propose to implement vector instructions directly inside the DRAM devices, which we call the Memory Vector Extensions (MVX). This balanced approach reduces data movement between the DRAM to the processor while requiring a low amount of hardware to achieve good performance. Comparing to current vector operations present on processors, our proposal enable performance gains of up to 97x and reduces the energy consumption by up to 70x of the full system.
computing frontiers | 2018
Joao Paulo C. de Lima; Paulo C. Santos; Marco Antonio Zanata Alves; Antonio Carlos Schneider Beck; Luigi Carro
Scaling existing architectures to large-scale data-intensive applications is limited by energy and performance losses caused by off-chip memory communication and data movements in the cache hierarchy. Processing-in-Memory (PIM) has been recently revisited to address the issues of memory and power wall, mainly due to the maturity of 3D-stacking manufacturing technology and the increasing demand for bandwidth and parallel access in emerging data-centric applications. Recent studies have shown a wide variety of processing mechanisms to be placed in the logic layer of 3D-stacked memories, not to mention the already available 3D-stacked DRAMs, such as Microns Hybrid Memory Cube (HMC). Nevertheless, a few studies compare PIM accelerators to each other and have made efforts to indicate the trade-offs between power, area, and performance. In this paper, we review different state-of-the-art 3D-stacked in-memory accelerators, and we analyze them considering important constraints regarding area and power due to critical embedded nature of PIM. Aiming to point in the direction of massive parallel PIM designs, we take the simplest design found in this survey, and we explore the architectural design space to meet the constraints imposed by HMC. Our results show that the most straightforward approach can provide the highest performance while consuming the lowest amount of area and power, which makes it the most suitable design found in this survey for an energy-efficient in-memory accelerator, whether it goes in High-Performance Computing or Embedded Systems. For instance, the outstanding point in the design space indicates that a performance density of 320 GBps/mm2 and a performance efficiency of 0.6 GBps/mW can be achieved in the best scenario, that is, when a massive parallel application reaches the peak bandwidth.
international embedded systems symposium | 2015
Paulo C. Santos; Marco Antonio Zanata Alves; Luigi Carro
The evolution of main memories, from SDR to the current DDR, presents multiple technological breakthroughs, but still far from the requirements of the processors. With the advent of Hybrid Memory Cube (HMC), a promise of high bandwidth with low energy consumption and less area may provide better efficiency than the traditional DDR modules. This is especially attractive for embedded systems. In this paper, we perform a comprehensive performance comparison between HMC and DDR memories, to understand the capabilities and limitations of both. Simulation results running SPEC-CPU2006 and SPEC-OMP2001 benchmarks show that applications with low memory pressure behave similarly with HMC or DDR. We make the new observation that HMC performs better than DDR specially for applications with a high memory pressure and low spatial data locality. However, for applications with a streaming behavior, commonly present in the embedded system domain, our experiments show that current HMC row-buffer specifications do not take advantage of the spatial locality present in those applications.
reconfigurable computing and fpgas | 2012
Paulo C. Santos; Gabriel L. Nazar; Luigi Carro; Fakhar Anjam; Stephan Wong
The increasing power density found in newer manufacturing technologies dictates that it is no longer possible for the whole chip to operate at full capacity the entire time. Adaptable systems must be devised to dynamically throttle their power consumption while maintaining the high performance expected by users. Furthermore, adapting processing and memory capacities leads to variable requirements of the communication infrastructure. Thus, in order to find the best solutions in the available design space, adaptability should be applied concordantly to three system axes: processing, memory and communication. In this work, we present the case for an architecture able to dynamically adapt its performance in all such layers. We focus on providing and adaptable Network-on-Chip able to dynamically meet the requirements of a reconfigurable processor.
design, automation, and test in europe | 2018
Paulo C. Santos; Geraldo F. Oliveira; João Paulo Silva Lima; Marco Antonio Zanata Alves; Luigi Carro; Antonio Carlos Schneider Beck