Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Paolo Rech is active.

Publication


Featured researches published by Paolo Rech.


high-performance computer architecture | 2015

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.


IEEE Transactions on Nuclear Science | 2013

An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs

Paolo Rech; C. Aguiar; Christopher Frost; Luigi Carro

Neutron radiation experiment results on matrix multiplication on graphic processing units (GPUs) show that multiple errors are detected at the output in more than 50% of the cases. In the presence of multiple errors, the available hardening strategies may become ineffective or inefficient. Analyzing radiation-induced error distributions, we developed an optimized and experimentally tuned software-based hardening strategy for GPUs. With fault-injection simulations, we compare the performance and correcting capabilities of the proposed technique with the available ones.


dependable systems and networks | 2014

Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability

Paolo Rech; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Luigi Carro

Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.


design, automation, and test in europe | 2014

GPGPUs: How to combine high computational power with high reliability

L. Bautista Gomez; Franck Cappello; Luigi Carro; Nathan DeBardeleben; Bo Fang; Sudhanva Gurumurthi; Karthik Pattabiraman; Paolo Rech; M. Sonza Reorda

GPGPUs are used increasingly in several domains, from gaming to different kinds of computationally intensive applications. In many applications GPGPU reliability is becoming a serious issue, and several research activities are focusing on its evaluation. This paper offers an overview of some major results in the area. First, it shows and analyzes the results of some experiments assessing GPGPU reliability in HPC datacenters. Second, it provides some recent results derived from radiation experiments about the reliability of GPGPUs. Third, it describes the characteristics of an advanced fault-injection environment, allowing effective evaluation of the resiliency of applications running on GPGPUs.


radiation effects data workshop | 2012

Neutron-Induced Soft Errors in Graphic Processing Units

Paolo Rech; Caroline Aguiar; Ronaldo Rodrigues Ferreira; M. Silvestri; A. Griffoni; C. Frost; Luigi Carro

This paper presents and analyzes the results of neutron experiments on 40nm Graphic Processing Units. We have measured the internal memory resources cross sections, and define a new threads cross section to characterize the computing units sensitivity to radiation. We experimentally evaluate the matrix multiplication application error rate and built an analytical model to predict algorithms neutron-induced failures.


international on-line testing symposium | 2012

Neutron radiation test of graphic processing units

Paolo Rech; Caroline Aguiar; Ronaldo Rodrigues Ferreira; Christopher Frost; Luigi Carro

This paper reports and analyzes the results of neutrons radiation testing campaigns on a modern commercial-off-the-shelf Graphic Processing Unit (GPU). A set of guidelines for accelerated radiation experiments on CPUs is presented, emphasizing the shrewdness necessary to ease the test and gain meaningful data. Radiation test results are presented and discussed, highlighting the neutrons sensitivities of the different GPU memory and logic resources in terms of Failure In Time (FIT) due to neutrons at sea level.


IEEE Transactions on Nuclear Science | 2012

Neutron-Induced Failure in Silicon IGBTs, Silicon Super-Junction and SiC MOSFETs

Alessio Griffoni; Jeroen van Duivenbode; Dimitri Linten; Eddy Simoen; Paolo Rech; Luigi Dilillo; F. Wrobel; Patrick Verbist; G. Groeseneken

50 MeV and 80 MeV neutron-induced failure is investigated for several types of power devices (super-junction, IGBT and SiC) from different vendors. A strong dependence on the device type and orientation is observed.


IEEE Transactions on Computers | 2016

Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

Daniel Oliveira; Laércio Lima Pilla; Thiago Santini; Paolo Rech

Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.


IEEE Transactions on Nuclear Science | 2015

Using Benchmarks for Radiation Testing of Microprocessors and FPGAs

Heather Quinn; William H. Robinson; Paolo Rech; M. A. Aguirre; Arno Barnard; Marco Desogus; Luis Entrena; Mario García-Valderas; Steven M. Guertin; David R. Kaeli; Fernanda Lima Kastensmidt; Bradley T. Kiddie; Antonio Sanchez-Clemente; Matteo Sonza Reorda; Luca Sterpone; Michael Wirthlin

Performance benchmarks have been used over the years to compare different systems. These benchmarks can be useful for researchers trying to determine how changes to the technology, architecture, or compiler affect the systems performance. No such standard exists for systems deployed into high radiation environments, making it difficult to assess whether changes in the fabrication process, circuitry, architecture, or software affect reliability or radiation sensitivity. In this paper, we propose a benchmark suite for high-reliability systems that is designed for field-programmable gate arrays and microprocessors. We describe the development process and report neutron test data for the hardware and software benchmarks.


IEEE Transactions on Nuclear Science | 2014

Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs

Laércio Lima Pilla; Paolo Rech; Francesco Silvestri; C. Frost; Philippe Olivier Alexandre Navaux; M. Sonza Reorda; Luigi Carro

In this paper we assess the neutron sensitivity of Graphics Processing Units (GPUs) when executing a Fast Fourier Transform (FFT) algorithm, and propose specific software-based hardening strategies to reduce its failure rate. Our research is motivated by experimental results with an unhardened FFT that demonstrate a majority of multiple errors in the output in the case of failures, which are caused by data dependencies. In addition, the use of the built-in error-correction code (ECC) showed a large overhead, and proved to be insufficient to provide high reliability. Experimental results with the hardened algorithm show a two orders of magnitude failure rate improvement over the original algorithm (one order of magnitude over ECC) and an overhead 64% smaller than ECC.

Collaboration


Dive into the Paolo Rech's collaboration.

Top Co-Authors

Avatar

Luigi Carro

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Fernanda Lima Kastensmidt

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Philippe Olivier Alexandre Navaux

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Christopher Frost

Rutherford Appleton Laboratory

View shared research outputs
Top Co-Authors

Avatar

Daniel Oliveira

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Lucas A. Tambara

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Luigi Dilillo

University of Montpellier

View shared research outputs
Top Co-Authors

Avatar

Gabriel L. Nazar

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

F. Wrobel

University of Montpellier

View shared research outputs
Top Co-Authors

Avatar

Heather Quinn

Los Alamos National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge