Is this you? Create Your Porfile

Daniel Oliveira

Universidade Federal do Rio Grande do Sul

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Oliveira is active.

Explore More

Publication

Featured researches published by Daniel Oliveira.

high-performance computer architecture | 2015

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.

IEEE Transactions on Computers | 2016

Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

Daniel Oliveira; Laércio Lima Pilla; Thiago Santini; Paolo Rech

Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.

IEEE Transactions on Nuclear Science | 2014

Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison

Daniel Oliveira; Paolo Rech; Heather Quinn; Thomas D. Fairbanks; Laura Monroe; Sarah Michalak; Christine M. Anderson-Cook; Philippe Olivier Alexandre Navaux; Luigi Carro

Graphics processing units (GPUs) are increasingly common in both safety-critical and high-performance computing (HPC) applications. Some current supercomputers are composed of thousands of GPUs so the probability of device corruption becomes very high. Moreover, the GPUs parallel capabilities are very attractive for the automotive and aerospace markets, where reliability is a serious concern. In this paper, the neutron sensitivity of the modern GPU caches, and internal resources are experimentally evaluated. Various Duplication With Comparison strategies to reduce GPU radiation sensitivity are then presented and validated through radiation experiments. Threads should be carefully duplicated to avoid undesired errors on shared resources and to avoid the exacerbation of errors in critical resources such as the scheduler.

defect and fault tolerance in vlsi and nanotechnology systems | 2014

GPGPUs ECC efficiency and efficacy

Daniel Oliveira; Paolo Rech; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Luigi Carro

In this paper we assess and discuss the efficiency and overhead of the Error-Correcting Code (ECC) mechanism available on modern GPGPUs, which are increasingly used for both High Performance Computing and safety-critical applications. Both the resilience to radiation-induced silent data corruption and functional interruption are experimentally and analytically addressed. The provided experimental analysis demonstrates that the ECC significantly reduces the occurrence of silent data corruption but may not be sufficient to guarantee high reliability. Moreover, the ECC increases the GPGPU functional interruption rate. Finally, the ECC performances and reliability are compared to Algorithm-Based Fault Tolerance and Duplication With Comparison strategies.

Microelectronics Reliability | 2018

Energy-Delay-FIT Product to compare processors and algorithm implementations

Vinicius Fratin; Daniel Oliveira; Philippe Olivier Alexandre Navaux; Luigi Carro; Paolo Rech

Abstract We propose an extension to the Energy-Delay-Product (EDP) metric to compare different processors considering not only their energy consumption and execution time but also reliability. The Energy-Delay-FIT Product (EDFP) allows a pragmatic evaluation of the most suitable device to run an application. We consider three representative benchmarks and apply EDFP to compare Intel Xeon-Phi co-processors, NVIDIA K40 Graphics Processing Units (GPUs), and AMD Kaveri Accelerated Processing Units (APUs). Our results show that HPC processors have higher power consumption and are more prone to be corrupted than APUs. However, the overall trade-off is attenuated by HPC processors efficiency, which makes them the most suitable candidates for the great majority of the considered applications. Additionally, we use EDFP to compare optimized and naive implementations of three benchmarks as executed on NVIDIA GPUs. Our results show that the naive implementation has generally better EDFP only for small input sizes while the optimized implementations are more efficient and reliable once the GPU resources are saturated.

computing frontiers | 2017

CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators

Daniel Oliveira; Vinicius Frattin; Philippe Olivier Alexandre Navaux; Israel Koren; Paolo Rech

Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67,000 faults into an Intel Xeon Phi executing six representative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.

IEEE Transactions on Nuclear Science | 2017

Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

Caio Lunardi; Heather Quinn; Laura Monroe; Daniel Oliveira; Philippe Olivier Alexandre Navaux; Paolo Rech

In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.

Archive | 2016

Soft-Error Effects on Graphics Processing Units

Paolo Rech; Daniel Oliveira; Philippe Olivier Alexandre Navaux; Luigi Carro

Graphics Processing Units (GPUs) evolved from graphics-specific devices to general-purpose computing accelerators that scientists use to run large-scale simulations. Additionally, GPUs are very attractive for safety-critical applications that extensively use signal or image processing.

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale | 2015

The Path to Exascale: Code Optimizations and Hardening Solutions Reliability

Daniel Oliveira; Laércio Lima Pilla; Caio Lunardi; Luigi Carro; Philippe Olivier Alexandre Navaux; Paolo Rech

Graphics Processing Units are nowadays the most common general-purpose computing accelerators employed in High Performance Computing (HPC) systems. The performance and energy efficiency of such devices enables extremely powerful HPC systems to be built. However, as the machine scale increases, the reliability problem increases as well, with failures on an exascale system expected to occur every few hours. We present data obtained at Los Alamos Neutron Science Center and measure how algorithms optimization and hardening strategies impact the Silent Data Corruption and crash sensitivity of modern GPUs. We also extend our reliability analysis by evaluating the Mean Executions and Mean Workload Between Failures of the different algorithms implementations. Moreover, we push even more the compromise of reliability and performance applying hardening strategies to current optimized codes. We show that common strategies, such as ECC and Checkpoint-rollback, can be no match to strategies like Algorithm-Based Fault Tolerance and even Duplication with Comparison.

high-performance computer architecture | 2017

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Daniel Oliveira; Laércio Lima Pilla; Mauricio Hanzich; Vinicius Fratin; Fernando Fernandes; Caio B. Lunardi; José María Cela; Philippe Olivier Alexandre Navaux; Luigi Carro; Paolo Rech

Explore More