Paolo Rech
Universidade Federal do Rio Grande do Sul
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paolo Rech.
high-performance computer architecture | 2015
Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland
Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.
IEEE Transactions on Nuclear Science | 2013
Paolo Rech; C. Aguiar; Christopher Frost; Luigi Carro
Neutron radiation experiment results on matrix multiplication on graphic processing units (GPUs) show that multiple errors are detected at the output in more than 50% of the cases. In the presence of multiple errors, the available hardening strategies may become ineffective or inefficient. Analyzing radiation-induced error distributions, we developed an optimized and experimentally tuned software-based hardening strategy for GPUs. With fault-injection simulations, we compare the performance and correcting capabilities of the proposed technique with the available ones.
dependable systems and networks | 2014
Paolo Rech; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Luigi Carro
Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.
design, automation, and test in europe | 2014
L. Bautista Gomez; Franck Cappello; Luigi Carro; Nathan DeBardeleben; Bo Fang; Sudhanva Gurumurthi; Karthik Pattabiraman; Paolo Rech; M. Sonza Reorda
GPGPUs are used increasingly in several domains, from gaming to different kinds of computationally intensive applications. In many applications GPGPU reliability is becoming a serious issue, and several research activities are focusing on its evaluation. This paper offers an overview of some major results in the area. First, it shows and analyzes the results of some experiments assessing GPGPU reliability in HPC datacenters. Second, it provides some recent results derived from radiation experiments about the reliability of GPGPUs. Third, it describes the characteristics of an advanced fault-injection environment, allowing effective evaluation of the resiliency of applications running on GPGPUs.
radiation effects data workshop | 2012
Paolo Rech; Caroline Aguiar; Ronaldo Rodrigues Ferreira; M. Silvestri; A. Griffoni; C. Frost; Luigi Carro
This paper presents and analyzes the results of neutron experiments on 40nm Graphic Processing Units. We have measured the internal memory resources cross sections, and define a new threads cross section to characterize the computing units sensitivity to radiation. We experimentally evaluate the matrix multiplication application error rate and built an analytical model to predict algorithms neutron-induced failures.
international on-line testing symposium | 2012
Paolo Rech; Caroline Aguiar; Ronaldo Rodrigues Ferreira; Christopher Frost; Luigi Carro
This paper reports and analyzes the results of neutrons radiation testing campaigns on a modern commercial-off-the-shelf Graphic Processing Unit (GPU). A set of guidelines for accelerated radiation experiments on CPUs is presented, emphasizing the shrewdness necessary to ease the test and gain meaningful data. Radiation test results are presented and discussed, highlighting the neutrons sensitivities of the different GPU memory and logic resources in terms of Failure In Time (FIT) due to neutrons at sea level.
IEEE Transactions on Nuclear Science | 2012
Alessio Griffoni; Jeroen van Duivenbode; Dimitri Linten; Eddy Simoen; Paolo Rech; Luigi Dilillo; F. Wrobel; Patrick Verbist; G. Groeseneken
50 MeV and 80 MeV neutron-induced failure is investigated for several types of power devices (super-junction, IGBT and SiC) from different vendors. A strong dependence on the device type and orientation is observed.
IEEE Transactions on Computers | 2016
Daniel Oliveira; Laércio Lima Pilla; Thiago Santini; Paolo Rech
Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.
IEEE Transactions on Nuclear Science | 2015
Heather Quinn; William H. Robinson; Paolo Rech; M. A. Aguirre; Arno Barnard; Marco Desogus; Luis Entrena; Mario García-Valderas; Steven M. Guertin; David R. Kaeli; Fernanda Lima Kastensmidt; Bradley T. Kiddie; Antonio Sanchez-Clemente; Matteo Sonza Reorda; Luca Sterpone; Michael Wirthlin
Performance benchmarks have been used over the years to compare different systems. These benchmarks can be useful for researchers trying to determine how changes to the technology, architecture, or compiler affect the systems performance. No such standard exists for systems deployed into high radiation environments, making it difficult to assess whether changes in the fabrication process, circuitry, architecture, or software affect reliability or radiation sensitivity. In this paper, we propose a benchmark suite for high-reliability systems that is designed for field-programmable gate arrays and microprocessors. We describe the development process and report neutron test data for the hardware and software benchmarks.
IEEE Transactions on Nuclear Science | 2014
Laércio Lima Pilla; Paolo Rech; Francesco Silvestri; C. Frost; Philippe Olivier Alexandre Navaux; M. Sonza Reorda; Luigi Carro
In this paper we assess the neutron sensitivity of Graphics Processing Units (GPUs) when executing a Fast Fourier Transform (FFT) algorithm, and propose specific software-based hardening strategies to reduce its failure rate. Our research is motivated by experimental results with an unhardened FFT that demonstrate a majority of multiple errors in the output in the case of failures, which are caused by data dependencies. In addition, the use of the built-in error-correction code (ECC) showed a large overhead, and proved to be insufficient to provide high reliability. Experimental results with the hardened algorithm show a two orders of magnitude failure rate improvement over the original algorithm (one order of magnitude over ECC) and an overhead 64% smaller than ECC.
Collaboration
Dive into the Paolo Rech's collaboration.
Philippe Olivier Alexandre Navaux
Universidade Federal do Rio Grande do Sul
View shared research outputs