Raul Esteban Silvera | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raul Esteban Silvera is active.

Explore More

Publication

Featured researches published by Raul Esteban Silvera.

international conference on parallel architectures and compilation techniques | 2012

Evaluation of Blue Gene/Q hardware support for transactional memories

Amy Wang; Matthew Gaudet; Peng Wu; José Nelson Amaral; Martin Ohmacht; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP benchmarks on BG/Q is the first of its kind in understanding characteristics of running coarse-grained TM workloads on HTMs. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

international workshop on openmp | 2003

Is the schedule clause really necessary in OpenMP

Eduard Ayguadé; Bob Blainey; Alejandro Duran; Jesús Labarta; Francisco Martínez; Xavier Martorell; Raul Esteban Silvera

Choosing the appropriate assignment of loop iterations to threads is one of the most important decisions that need to be taken when parallelizing Loops, the main source of parallelism in numerical applications. This is not an easy task, even for expert programmers, and it can potentially take a large amount of time. OpenMP offers the schedule clause, with a set of predefined iteration scheduling strategies, to specify how (and when) this assignment of iterations to threads is done. In some cases, the best schedule depends on architectural characteristics of the target architecture, data input, ... making the code less portable. Even worse, the best schedule can change along execution time depending on dynamic changes in the behavior of the loop or changes in the resources available in the system. Also, for certain types of imbalanced loops, the schedulers already proposed in the literature are not able to extract the maximum parallelism because they do not appropriately trade-off load balancing and data locality. This paper proposes a new scheduling strategy, that derives at run time the best scheduling policy for each parallel loop in the program, based on information gathered at runtime by the library itself.

international symposium on memory management | 2008

MPADS: memory-pooling-assisted data splitting

Stephen Curial; Peng Zhao; José Nelson Amaral; Yaoqing Gao; Shimin Cui; Raul Esteban Silvera; Roch Georges Archambault

This paper describes Memory-Pooling-Assisted Data Splitting (MPADS), a framework that combines data structure splitting with memory pooling --- Although it MPADS may call to mind memory padding, a distintion of this framework is that is does not insert padding. MPADS relies on pointer analysis to ensure that splitting is safe and applicable to type-unsafe language. MPADS makes no assumption about type safety. The analysis can identify cases in which the transformation could lead to incorrect code and thus MPADS abandons those cases. To make data structure splitting efficient in a commercial compiler, MPADS is designed with great attention to reduce the number of instructions required to access the data after the data-structure splitting. Moreover the implementation of MPADS reveals that architecture details should be considered carefully when re-arranging data allocation. For instance one of the most significant gains from the introduction of data-structure splitting in code targetting the IBM POWER architecture is a dramatic decrease in the amount of data prefetched by the hardware prefetch engine without a noticeable decrease in the cache utilization. Triggering fewer hardware prefetch streams frees memory bandwidth and cache space. Fewer prefetching streams also reduce the interference between the data accessed by multiple cores in modern multicore processors.

ACM Transactions on Programming Languages and Systems | 2007

Forma : A framework for safe automatic array reshaping

Peng Zhao; Shimin Cui; Yaoqing Gao; Raul Esteban Silvera; José Nelson Amaral

This article presents Forma, a practical, safe, and automatic data reshaping framework that reorganizes arrays to improve data locality. Forma splits large aggregated data-types into smaller ones to improve data locality. Arrays of these large data types are then replaced by multiple arrays of the smaller types. These new arrays form natural data streams that have smaller memory footprints, better locality, and are more suitable for hardware stream prefetching. Forma consists of a field-sensitive alias analyzer, a data type checker, a portable structure reshaping planner, and an array reshaper. An extensive experimental study compares different data reshaping strategies in two dimensions: (1) how the data structure is split into smaller ones (maximal partition × frequency-based partition × affinity-based partition); and (2) how partitioned arrays are linked to preserve program semantics (address arithmetic-based reshaping × pointer-based reshaping). This study exposes important characteristics of array reshaping. First, a practical data reshaper needs not only an inter-procedural analysis but also a data-type checker to make sure that array reshaping is safe. Second, the performance improvement due to array reshaping can be dramatic: standard benchmarks can run up to 2.1 times faster after array reshaping. Array reshaping may also result in some performance degradation for certain benchmarks. An extensive micro-architecture-level performance study identifies the causes for this degradation. Third, the seemingly naive maximal partition achieves best or close-to-best performance in the benchmarks studied. This article presents an analysis that explains this surprising result. Finally, address-arithmetic-based reshaping always performs better than its pointer-based counterpart.

conference of the centre for advanced studies on collaborative research | 2008

OpenMP tasks in IBM XL compilers

Xavier Teruel; Priya Unnikrishnan; Xavier Martorell; Eduard Ayguadé; Raul Esteban Silvera; Guansong Zhang; Ettore Tiotto

Tasking is the most significant feature included in the new OpenMP 3.0 standard. It was introduced to handle unstructured parallelism and broaden the range of applications that can be parallelized by OpenMP. This paper presents the design and implementation of the task model in the IBM XL parallelizing compilers. The task construct is significantly different from other OpenMP constructs. This paper discusses some of the unique challenges in implementing the task construct and its associated synchronization constructs in the compiler. We also present a performance evaluation of our implementation on a set of benchmarks and applications. We identify limitations in the current implentation and propose solutions for further improvement.

international workshop on openmp | 2004

Runtime adjustment of parallel nested loops

Alejandro Duran; Raul Esteban Silvera; Julita Corbalan; Jesús Labarta

OpenMP allows programmers to specify nested parallelism in parallel applications. In the case of scientific applications, parallel loops are the most important source of parallelism. In this paper we present an automatic mechanism to dynamically detect the best way to exploit the parallelism when having nested parallel loops. This mechanism is based on the number of threads, the problem size, and the number of iterations on the loop. To do that, we claim that programmers must specify the potential application parallelism and give the runtime the responsibility to decide the best way to exploit it. We have implemented this mechanism inside the IBM XL runtime library. Evaluation shows that our mechanism dynamically adapts the parallelism generated to the application and runtime parameters, reaching the same speedup as the best static parallelization (with a priori information).

IEEE Transactions on Computers | 2015

Software Support and Evaluation of Hardware Transactional Memory on Blue Gene/Q

Amy Wang; Matthew Gaudet; Peng Wu; Martin Ohmacht; José Nelson Amaral; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of a transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q machine. The TM programming model supports most C/C++ programming constructs using a best-effort HTM and the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP and the RMS-TM benchmark suites on BG/Q is the first of its kind in understanding characteristics of running TM workloads on real hardware TM. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

international conference on parallel processing | 2012

A practical approach to DOACROSS parallelization

Priya Unnikrishnan; Jun Shirako; Kit Barton; Sanjay Chatterjee; Raul Esteban Silvera; Vivek Sarkar

Loops with cross-iteration dependences (doacross loops) often contain significant amounts of parallelism that can potentially be exploited on modern manycore processors. However, most production-strength compilers focus their automatic parallelization efforts on doall loops, and consider doacross parallelism to be impractical due to the space inefficiencies and the synchronization overheads of past approaches. This paper presents a novel and practical approach to automatically parallelizing doacross loops for execution on manycore-SMP systems. We introduce a compiler-and-runtime optimization called dependence folding that bounds the number of synchronization variables allocated per worker thread (processor core) to be at most the maximum depth of a loop nest being considered for automatic parallelization. Our approach has been implemented in a development version of the IBM XL Fortran V13.1 commercial parallelizing compiler and runtime system. For four benchmarks where automatic doall parallelization was largely ineffective (speedups of under 2×), our implementation delivered speedups of 6.5×, 9.0×, 17.3×, and 17.5× on a 32-core IBM Power7 SMP system, thereby showing that doacross parallelization can be a valuable technique to complement doall parallelization.

european conference on object oriented programming | 2013

Simple profile rectifications go a long way

Bo Wu; Mingzhou Zhou; Xipeng Shen; Yaoqing Gao; Raul Esteban Silvera; Graham Yiu

Feedback-driven program optimization (FDO) is common in modern compilers, including Just-In-Time compilers increasingly adopted for object-oriented or scripting languages. This paper describes a systematic study in understanding and alleviating the effects of sampling errors on the usefulness of the obtained profiles for FDO. Taking a statistical approach, it offers a series of counter-intuitive findings, and identifies two kinds of profile errors that affect FDO critically, namely zero-count errors and inconsistency errors. It further proposes statistical profile rectification, a simple approach to correcting profiling errors by leveraging statistical patterns in a profile. Experiments show that the simple approach enhances the effectiveness of sampled profile-based FDO dramatically, increasing the average FDO speedup from 1.16X to 1.3X, around 92% of what full profiles can yield.

conference of the centre for advanced studies on collaborative research | 2009

OpenMP tasking analysis for programmers

Xavier Teruel; Christopher Barton; Alejandro Duran; Xavier Martorell; Eduard Ayguadé; Priya Unnikrishnan; Guansong Zhang; Raul Esteban Silvera

As of 2008, the OpenMP 3.0 standard includes task support allowing programmers to exploit irregular parallelism. Although several compilers are providing support for this new feature there has not been extensive investigation into the real possibilities of this extension. Several papers have discussed the programming model itself while other papers have discussed design and implementation on different platforms. There are also papers demonstrating performance results using well known kernel applications. This paper presents an analysis of the OpenMP tasking model possibilities, using the IBM XL compiler implementation. Using different parameters such as the number of tasks, task granularity and parallelism pattern, this paper explores how such parameters can affect the average performance and identifies the limits of the OpenMP tasking model.

Explore More