Is this you? Create Your Porfile

Roger Ferrer

Barcelona Supercomputing Center

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roger Ferrer is active.

Explore More

Publication

Featured researches published by Roger Ferrer.

international conference on parallel processing | 2009

Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP

Alejandro Duran; Xavier Teruel; Roger Ferrer; Xavier Martorell; Eduard Ayguadé

Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only a few applications exploit sections parallelism. With the release of the new OpenMP specification (3.0), this programming model supports tasking. Parallel tasks allow the exploitation of irregular parallelism, but there is a lack of benchmarks exploiting tasks in OpenMP. With the current (and projected) multicore architectures that offer many more alternatives to execute parallel applications than traditional SMP machines, this kind of parallelism is increasingly important. And so, the need to have some set of benchmarks to evaluate it. In this paper, we motivate the need of having such a benchmarks suite, for irregular and/or recursive task parallelism. We present our proposal, the Barcelona OpenMP Tasks Suite (BOTS), with a set of applications exploiting regular and irregular parallelism, based on tasks. We present an overall evaluation of the BOTS benchmarks in an Altix system and we discuss some of the different experiments that can be done with the different compilation and runtime alternatives of the benchmarks.

International Journal of Parallel Programming | 2010

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Eduard Ayguadé; Rosa M. Badia; Pieter Bellens; Daniel Cabrera; Alejandro Duran; Roger Ferrer; Marc Gonzàlez; Francisco D. Igual; Daniel Jiménez-González; Jesus Labarta; Luis Martinell; Xavier Martorell; Rafael Mayo; Josep M. Perez; Judit Planas; Enrique S. Quintana-Ortí

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

conference of the centre for advanced studies on collaborative research | 2007

Support for OpenMP tasks in Nanos v4

Xavier Teruel; Xavier Martorell; Alejandro Duran; Roger Ferrer; Eduard Ayguadé

In this paper we describe an implementation overview of Nanos v4: an OpenMP Run Time Library (RTL) based on the nano-threads programming model. Our main goal is to discuss different aspects of the library development focusing on the implementation of a new feature introduced in the last OpenMP release: task support. We compare the performance of our prototype implementation and the workqueuing model available on the Intel compiler with a set of kernel applications.

International Journal of Parallel Programming | 2009

A proposal to extend the OpenMP tasking model with dependent tasks

Alejandro Duran; Roger Ferrer; Eduard Ayguadé; Rosa M. Badia; Jesús Labarta

Tasking in OpenMP 3.0 has been conceived to handle the dynamic generation of unstructured parallelism. New directives have been added allowing the user to identify units of independent work (tasks) and to define points to wait for the completion of tasks (task barriers). In this document we propose extensions to allow the runtime detection of dependencies between generated tasks, broading the range of applications that can benefit from tasking or improving the performance when load balancing or locality are critical issues for performance. The proposed extensions are evaluated on a SGI Altix multiprocessor architecture using a couple of small applications and a prototype runtime system implementation.

languages and compilers for parallel computing | 2010

Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

Roger Ferrer; Judit Planas; Pieter Bellens; Alejandro Duran; Marc Gonzàlez; Xavier Martorell; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta

In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmers productivity.

International Journal of Parallel Programming | 2011

ACOTES project: Advanced compiler technologies for embedded streaming

Eduard Ayguadé; Cédric Bastoul; Paul M. Carpenter; Zbigniew Chamski; Albert Cohen; Marco Cornero; Philippe Dumont; Marc Duranton; Mohammed Fellahi; Roger Ferrer; Razya Ladelsky; Menno Lindwer; Xavier Martorell; Cupertino Miranda; Dorit Nuzman; Andrea Ornstein; Antoniu Pop; Sebastian Pop; Louis-Noël Pouchet; Alex Ramirez; David Ródenas; Erven Rohou; Ira Rosen; Uzi Shvadron; Konrad Trifunovic; Ayal Zaks

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

international workshop on openmp | 2007

Transactional Memory and OpenMP

Milos Milovanovic; Roger Ferrer; Osman S. Unsal; Adrián Cristal; Xavier Martorell; Eduard Ayguadé; Jesús Labarta; Mateo Valero

Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++, and Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents a first proof-of-concept implementation of OpenMP with TM. Some extensions to the language are proposed to express transactions. These extensions are handled by our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM) library Nebelung that supports the code generated by Mercurium. The current implementation of the library has no support at the hardware level, so it is a proof-of-concept implementation. Hardware Transactional Memory (HTM) or Hardware-assisted STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more usable. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software implementations of TM.

languages and compilers for parallel computing | 2009

Unrolling loops containing task parallelism

Roger Ferrer; Alejandro Duran; Xavier Martorell; Eduard Ayguadé

Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains parallelism inside most compilers will ignore it or perform a naive transformation. We propose to extend the semantics of the loop unrolling transformation to cover loops that contain task parallelism. In these cases, the transformation will try to aggregate the multiple tasks that appear after a classic unrolling phase to reduce the overheads per iteration. We present an implementation of such extended loop unrolling for OpenMP tasks with two phases: a classical unroll followed by a task aggregation phase. Our aggregation technique covers the special cases where task parallelism appears inside branches or where the loop is uncountable. Our experimental results show that using this extended unroll allows loops with fine-grained tasks to reduce the overheads associated with task creation and obtain a much better scaling.

memory performance dealing with applications systems and architecture | 2007

Multithreaded software transactional memory and OpenMP

Milos Milovanovic; Roger Ferrer; Vladimir Gajinov; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Mateo Valero

Transactional Memory (TM) is a key future technology for emerging many-cores. On the other hand, OpenMP provides a vast established base for writing parallel programs, especially for scientific applications. Combining TM with OpenMP provides a rich, enhanced programming environment and an attractive solution to the many-core software productivity problem. In this paper, we discuss the first multithreaded runtime environment for supporting our combined TM and OpenMP framework. We present the extensions of OpenMP for using Transactional Memory. We then present the novel multithreaded STM design with a dedicated thread for eager asynchronous conflict detection. Conflict detection will be executed in a separate thread so the transaction will not waste time on that. On the other hand, eager conflict detection is asynchronous which will increase speculative parallel execution. We also include an initial performance analysis of the runtime system and possible gains which can be achieved.

memory performance dealing with applications systems and architecture | 2008

Evaluation of memory performance on the cell BE with the SARC programming model

Roger Ferrer; Marc Gonzàlez; Federico Silla; Xavier Martorell; Eduard Ayguadé

With the advent of multicore architectures, especially with the heterogeneous ones, both computational and memory top performance are difficult to obtain using traditional programming models. Usually, programmers have to fully reorganize the code and data of their applications in order to maximize resource usage, and work with the low-level interfaces offered by the vendor-provided SDKs, to obtain high computational and memory performances. In this paper, we present the evaluation of the SARC programming model on the Cell BE architecture, with respect to memory performance. We show how we have annotated the HPL STREAM and RandomAccess applications, and the memory bandwidth obtained. Results indicate that the programming model provides good productivity and competitive performance on this kind of architectures.

Explore More