Arthur Francisco Lorenzon
Universidade Federal do Rio Grande do Sul
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arthur Francisco Lorenzon.
ACM Journal on Emerging Technologies in Computing Systems | 2017
Anderson Luiz Sartor; Arthur Francisco Lorenzon; Luigi Carro; Fernanda Lima Kastensmidt; Stephan Wong; Antonio Carlos Schneider Beck
Because of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable computing. In this scenario, our work proposes three low overhead fault tolerance approaches based on instruction duplication with zero latency detection, which uses a rollback mechanism to correct soft errors in the pipelanes of a configurable VLIW processor. The first uses idle issue slots within a period of time to execute extra instructions considering distinct application phases. The second works at a finer grain, adaptively exploiting idle functional units at run-time. However, some applications present high instruction-level parallelism (ILP), so the ability to provide fault tolerance is reduced: less functional units will be idle, decreasing the number of potential duplicated instructions. The third approach attacks this issue by dynamically reducing ILP according to a configurable threshold, increasing fault tolerance at the cost of performance. While the first two approaches achieve significant fault coverage with minimal area and power overhead for applications with low ILP, the latter improves fault tolerance with low performance degradation. All approaches are evaluated considering area, performance, power dissipation, and error coverage.
ieee computer society annual symposium on vlsi | 2015
Anderson Luiz Sartor; Arthur Francisco Lorenzon; Luigi Carro; Fernanda Lima Kastensmidt; Stephan Wong; Antonio Carlos Schneider Beck
Because of technology scaling, the soft error rate has been increasing in digital circuits, which in turn affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable computation. In this scenario, our work proposes two new low overhead fault tolerance approaches, with zero latency detection, that correct soft errors in the pipelines of a configurable VLIW processor. Each approach has a distinct way to detect errors, but they both utilize the same rollback mechanism. The first utilizes redundant hardware by having specialized duplicated pipelines. The second uses idle issue slots to execute duplicated instructions and does this by first identifying phases within an application. Our implementation does not require changes to the binary code and has negligible performance losses. It has 50% of area overhead with 35% power dissipation for the full pipeline duplication, and only 7% of extra area when using idle resources. We compared our approach to related work and demonstrate that we are more efficient when one considers the area, performance, power dissipation and error coverage altogether.
Journal of Parallel and Distributed Computing | 2016
Arthur Francisco Lorenzon; Márcia Cristina Cera; Antonio Carlos Schneider Beck
Abstract Thread-level parallelism (TLP) is being widely exploited in embedded and general-purpose multicore processors (GPPs) to increase performance. However, parallelizing an application involves extra executed instructions and accesses to the shared memory, to communicate and synchronize. The overhead of accessing the shared memory, which is very costly in terms of delay and energy because it is at the bottom of the hierarchy, varies depending on the communication model and level of data exchange/synchronization of the application. On top of that, multicore processors are implemented using different architectures, organizations and memory subsystems. In this complex scenario, we evaluate 14 parallel benchmarks implemented with 4 different parallel programming interfaces (PPIs), with distinct communication rates and TLP, running on five representative multicore processors targeted to general-purpose and embedded systems. We show that while the former presents the best performance and the latter will be the most energy efficient, there is no single option that offers the best result for both. We also demonstrate that in applications with low levels of communication, what matters is the communication model, not a specific PPI. On the other hand, applications with high communication demands have a huge search space that can be explored. For those, Pthreads is the most efficient PPI for Intel Processors, while OpenMP is the best for ARM ones. MPI is the worst choice in almost any scenario, and gets very inefficient as the TLP increases. We also evaluate energy delay x product (ED x P), weighting performance towards energy by varying the value of x . In a representative case where energy is the most important, three different processors can be the best alternative for different values of x . Finally, we explore how static power influences total energy consumption, showing that its increase brings benefits to ARM multiprocessors, with the opposite effect for Intel ones.
ieee computer society annual symposium on vlsi | 2015
Arthur Francisco Lorenzon; Anderson Luiz Sartor; Márcia Cristina Cera; Antonio Carlos Schneider Beck
Thread-level parallelism (TLP) exploitation for embedded systems has been a challenge for software developers: while it is necessary to take advantage of the availability of multiple cores, it is also mandatory to consume less energy. To speed up the development process and make it as transparent as possible, software designers use parallel programming interfaces (PPIs). However, as will be shown in this paper, each one implements different ways to exchange data, influencing performance, energy consumption and energy-delay product (EDP), which varies across different embedded processors. By evaluating four PPIs and three multicore processors, we demonstrate that it is possible to save up to 62% in energy consumption and achieve up to 88% of EDP improvements by just switching the PPI, and that the efficiency (i.e., The best possible use of the available resources) decreases as the number of threads increases in almost all cases, but at distinct rates.
2016 VI Brazilian Symposium on Computing Systems Engineering (SBESC) | 2016
Guilherme Grunewald Magalhaes; Anderson Luis Sartor; Arthur Francisco Lorenzon; Philippe Olivier Alexandre Navaux; Antonio Carlos Schneider Beck
Considering that multithreaded applications may be implemented using several programming languages and paradigms, in this work we show how they influence performance, energy consumption and energy-delay product (EDP). For that, we evaluate a subset of the NAS Parallel Benchmark, implemented in both procedural (C) and object-oriented programming languages (C++ and Java). We also investigate the overhead of Virtual Machines (VM) and the improvement that the Just-In-Time (JIT) compiler may provide. We show that the procedural language has better scalability than object-oriented ones, i.e., the improvements in performance, EDP, and energy savings are better in C than in C++ and Java as the number of threads increases; and that C can be up to 76 times faster than Java, even with the JIT mechanism enabled. We also demonstrate that the Java JIT effectiveness may vary according to the benchmark (1.16 and 23.97 times in performance and 1.19 to 19.85 times in energy consumption compared to the VM without JIT); and when it reaches good optimization levels, it can be up to 23% faster, consuming 42% less energy, and having an EDP 58% lower than C++.
reconfigurable computing and fpgas | 2015
Anthony Brandon; Joost Hoozemans; Jeroen van Straten; Arthur Francisco Lorenzon; Anderson Luiz Sartor; Antonio Carlos Schneider Beck; Stephan Wong
Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption.
international symposium on circuits and systems | 2015
Arthur Francisco Lorenzon; Márcia Cristina Cera; Antonio Carlos Schneider Beck
Energy consumption in multicore embedded systems has become a constant concern. Thread-Level Parallelism exploitation may reduce energy consumption because it saves static power consumption of the processor, since performance is obtained. However, as will be shown in this paper, the influence of the static power on the energy consumption and Energy-Delay Product will depend on how significant it is in the processor. By evaluating different levels of static power in the respect to the total power consumption in two embedded processors (ARM and Atom), we demonstrate that if the right value of static power consumption is tuned during the designing and manufacturing, it is possible to save up 35% in energy consumption and achieve up to 20% of improvements in the EDP efficiency (i.e., the best possible use of the available resources). We also show that the more communication the parallel application has, the lower is the impact of static power of the processor in the total energy consumption.
computer software and applications conference | 2015
Anderson Luiz Sartor; Arthur Francisco Lorenzon; Antonio Carlos Schneider Beck
Embedded systems are becoming increasingly complex and, due to their tight energy requirements, all the available resources must be used in the best possible way. However, Android, the most used software platform for embedded systems, features a virtual machine to run applications. Even though it ensures flexibility so the application can execute on different underlying architectures without the need for recompilation, it burdens the system because of the introduction of an extra software layer. Considering this scenario, through the development of an extension of the Android QEMU emulator and a specific benchmark set, this work evaluates the significance of the virtual machine by comparing applications written in Java and in native language. We show that, given a fixed energy budget, a different amount of applications can be executed depending the way they were implemented. We also demonstrate that this difference varies according to the processor, by executing the applications on all officially supported Android architectures (Intel x86, ARM, and MIPS). Therefore, even though the Virtual Machine provides total transparency to the software developer, he/she must be aware of it and the underlying target micro architecture at early designs stages so as to build a low-energy application.
design, automation, and test in europe | 2017
Arthur Francisco Lorenzon; Jeckson Dellagostin Souza; Antonio Carlos Schneider Beck
Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated.
computer software and applications conference | 2015
Arthur Francisco Lorenzon; Anderson Luiz Sartor; Márcia Cristina Cera; Antonio Carlos Schneider Beck
Thread-Level Parallelism (TLP) exploitation for embedded systems has been a challenge for software developers: while it is necessary to take advantage of the availability of multiple cores, it is also mandatory to consume less energy. To speed up the development process and make it as transparent as possible, software designers use Parallel Programming Interfaces (PPIs). However, as will be shown in this paper, each PPI implements different ways to exchange data using shared memory regions, influencing performance, energy consumption and Energy-Delay Product (EDP), which varies across different embedded processors. By evaluating four PPIs and three multicore processors (ARM A8, A9 and Intel Atom), we demonstrate that by simply switching PPI it is possible to save up to 59% in energy consumption and achieve up to 85% of EDP improvements, in the most significant case. We also show that the efficiency (i.e., The best possible use of the available resources) decreases as the number of threads increases in almost all cases, but at distinct rates.
Collaboration
Dive into the Arthur Francisco Lorenzon's collaboration.
Philippe Olivier Alexandre Navaux
Universidade Federal do Rio Grande do Sul
View shared research outputs