Is this you? Create Your Porfile

Roberto Gioiosa

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roberto Gioiosa is active.

Explore More

Publication

Featured researches published by Roberto Gioiosa.

ieee international symposium on workload characterization | 2013

Quantifying the energy cost of data movement in scientific applications

Gokcen Kestor; Roberto Gioiosa; Darren J. Kerbyson; Adolfy Hoisie

In the exascale era, the energy cost of moving data across the memory hierarchy is expected to be two orders of magnitude higher than the cost of performing a double-precision floating point operation. Despite its importance, the energy cost of data movement in scientific applications has not be quantitatively evaluated even for current systems.

international conference on parallel architectures and compilation techniques | 2012

Making data prefetch smarter: adaptive prefetching on POWER7

Víctor Jiménez; Roberto Gioiosa; Francisco J. Cazorla; Alper Buyuktosunoglu; Pradip Bose; Francis Patrick O'Connell

Hardware data prefetch engines are integral parts of many general purpose server-class microprocessors in the field today. Some prefetch engines allow the user to change some of their parameters. The prefetcher, however, is usually enabled in a default configuration during system bring-up and dynamic reconfiguration of the prefetch engine is not an autonomic feature of current machines. Conceptually, however, it is easy to infer that commonly used prefetch algorithms, when applied in a fixed mode will not help performance in many cases. In fact, they may actually degrade performance due to useless bus bandwidth consumption and cache pollution. In this paper, we present an adaptive prefetch scheme that dynamically modifies the prefetch settings in order to adapt to the workload requirements. We implement and evaluate adaptive prefetching in the context of an existing, commercial processor, namely the IBM POWER7. Our adaptive prefetch mechanism improves performance with respect to the default prefetch setting up to 2.7X and 30% for single-threaded and multiprogrammed workloads, respectively.

ieee international conference on high performance computing data and analytics | 2015

Understanding the propagation of transient errors in HPC applications

Rizwan A. Ashraf; Roberto Gioiosa; Gokcen Kestor; Ronald F. DeMara; Chen-Yong Cher; Pradip Bose

Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.

international symposium on memory management | 2017

RTHMS: a tool for data placement on hybrid memory system

Ivy Bo Peng; Roberto Gioiosa; Gokcen Kestor; Pietro Cicotti; Erwin Laure; Stefano Markidis

Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups. We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.

international conference on conceptual structures | 2016

A Performance Characterization of Streaming Computing on Supercomputers.

Stefano Markidis; Ivy Bo Peng; Roman Iakymchuk; Erwin Laure; Gokcen Kestor; Roberto Gioiosa

Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more ...

international conference on parallel processing | 2017

Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

Ivy Bo Peng; Roberto Gioiosa; Gokcen Kestor; Erwin Laure; Stefano Markidis

Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.

ieee acm international symposium cluster cloud and grid computing | 2017

Extending Message Passing Interface Windows to Storage

Sergio Rivas-Gomez; Stefano Markidis; Ivy Bo Peng; Erwin Laure; Gokcen Kestor; Roberto Gioiosa

This paper presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead.

symposium on computer architecture and high performance computing | 2014

Cross-Layer Self-Adaptive/Self-Aware System Software for Exascale Systems

Roberto Gioiosa; Gokcen Kestor; Darren J. Kerbyson; Adolfy Hoisie

The extreme level of parallelism coupled with the limited available power budget expected in the exascale era brings unprecedented challenges that demand optimization of performance, power and resiliency in unison. Scalability on such systems is of paramount importance, while power and reliability issues may change the execution environment in which a parallel application runs. To solve these challenges exascale systems will require an introspective system software that combines system and application observations across all system stack layers with online feedback and adaptation mechanisms. In this paper we propose the design of a novel self-aware, selfadaptive system software in which a kernel-level Monitor, which continuously inspects the evolution of the target system through observation of Sensors, is combined with a user-level Controller, which reacts to changes in the execution environment, explores opportunities to increase performance, save power and adapts applications to new execution scenarios. We show that the monitoring system accurately monitors the evolution of parallel applications with a runtime overhead below 1-2%. As a test case, we design and implement a runtime system that aims at optimizing applications performance and system power consumption on complex hierarchical architectures. Our results show that our adaptive system reaches 98% of performance efficiency of manually-tuned applications.

high performance computing and communications | 2016

Idle Period Propagation in Message-Passing Applications

Ivy Bo Peng; Stefano Markidis; Erwin Laure; Gokcen Kestor; Roberto Gioiosa

Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows application developers to design communication patterns avoiding idle period propagation and the consequent performance degradation in their applications. To understand idle period propagation, we introduce a methodology to trace idle periods when a process is waiting for data from a remote delayed process in MPI applications. We apply this technique in an MPI application that solves the heat equation to study idle period propagation on three different systems. We confirm that idle periods move between processes in the form of waves and that there are different stages in idle period propagation. Our methodology enables us to identify a self-synchronization phenomenon that occurs on two systems where some processes run slower than the other processes.

international conference on cluster computing | 2015

On the Application Task Granularity and the Interplay with the Scheduling Overhead in Many-Core Shared Memory Systems

Dana Akhmetova; Gokcen Kestor; Roberto Gioiosa; Stefano Markidis; Erwin Laure

Task-based programming models are considered one of the most promising programming model approaches for exascale supercomputers because of their ability to dynamically react to changing conditions and reassign work to processing elements. One question, however, remains unsolved: what should the task granularity of task-based applications be? Fine-grained tasks offer more opportunities to balance the system and generally result in higher system utilization. However, they also induce in large scheduling overhead. The impact of scheduling overhead on coarse-grained tasks is lower, but large systems may result imbalanced and underutilized. In this work we propose a methodology to analyze the interplay between application task granularity and scheduling overhead. Our methodology is based on three main points: 1) a novel task algorithm that analyzes an application directed acyclic graph (DAG) and aggregates tasks, 2) a fast and precise emulator to analyze the application behavior on systems with up to 1,024 cores, 3) a comprehensive sensitivity analysis of application performance and scheduling overhead breakdown. Our results show that there is an optimal task granularity between 1.2×104 and 10×104 cycles for the representative schedulers. Moreover, our analysis indicates that a suitable scheduler for exascale task-based applications should employ a best-effort local scheduler and a sophisticated remote scheduler to move tasks across worker threads.

Explore More