Josep M. Perez
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Josep M. Perez.
conference on high performance computing (supercomputing) | 2006
Pieter Bellens; Josep M. Perez; Rosa M. Badia; Jesús Labarta
In this work we present Cell superscalar (CellSs) which addresses the automatic exploitation of the functional parallelism of a sequential program through the different processing elements of the Cell BE architecture. The focus in on the simplicity and flexibility of the programming model. Based on a simple annotation of the source code, a source to source compiler generates the necessary code and a runtime library exploits the existing parallelism by building at runtime a task dependency graph. The runtime takes care of the task scheduling and data handling between the different processors of this heterogeneous architecture. Besides, a locality-aware task scheduling has been implemented to reduce the overhead of data transfers. The approach has been implemented and tested with a set of examples and the results obtained since now are promising
international conference on cluster computing | 2008
Josep M. Perez; Rosa M. Badia; Jesus Labarta
Parallel programming on SMP and multi-core architectures is hard. In this paper we present a programming model for those environments based on automatic function level parallelism that strives to be easy, flexible, portable, and performant. Its main trait is its ability to exploit task level parallelism by analyzing task dependencies at run time. We present the programming environment in the context of algorithms from several domains and pinpoint its benefits compared to other approaches. We discuss its execution model and its scheduler. Finally we analyze its performance and demonstrate that it offers reasonable performance without tuning, and that it can rival highly tuned libraries with minimal tuning effort.
Ibm Journal of Research and Development | 2007
Josep M. Perez; Pieter Bellens; Rosa M. Badia; Jesús Labarta
With the appearance of new multicore processor architectures, there is a need for new programming paradigms, especially for heterogeneous devices such as the Cell Broadband Engine™ (Cell/B.E.) processor. CellSs is a programming model that addresses the automatic exploitation of functional parallelism from a sequential application with annotations. The focus is on the flexibility and simplicity of the programming model. Although the concept and programming model are general enough to be extended to other devices, its current implementation has been tailored to the Cell/B.E. device. This paper presents an overview of CellSs and a newly implemented scheduling algorithm. An analysis of the results--both performance measures and a detailed analysis with performance analysis tools--was performed and is presented here.
international workshop on openmp | 2009
Eduard Ayguadé; Rosa M. Badia; Daniel Cabrera; Alejandro Duran; Marc Gonzàlez; Francisco D. Igual; Daniel Jimenez; Jesús Labarta; Xavier Martorell; Rafael Mayo; Josep M. Perez; Enrique S. Quintana-Ortí
OpenMP has evolved recently towards expressing unstructured parallelism, targeting the parallelization of a broader range of applications in the current multicore era. Homogeneous multicore architectures from major vendors have become mainstream, but with clear indications that a better performance/power ratio can be achieved using more specialized hardware (accelerators), such as SSE-based units or GPUs, clearly deviating from the easy-to-understand shared-memory homogeneous architectures. This paper investigates if OpenMP could still survive in this new scenario and proposes a possible way to extend the current specification to reasonably integrate heterogeneity while preserving simplicity and portability. The paper leverages on a previous proposal that extended tasking with dependencies. The runtime is in charge of data movement, tasks scheduling based on these data dependencies and the appropriate selection of the target accelerator depending on system configuration and resource availability.
international workshop on openmp | 2008
Alejandro Duran; Josep M. Perez; Eduard Ayguadé; Rosa M. Badia; Jesús Labarta
Tasking in OpenMP 3.0 has been conceived to handle the dynamicgeneration of unstructured parallelism. New directives have beenadded allowing the user to identify units of independent work (tasks) andto define points to wait for the completion of tasks (task barriers). Inthis paper we propose an extension to allow the runtime detection of dependenciesbetween generated tasks, broading the range of applicationsthat can benefit from tasking or improving the performance when loadbalancing or locality are critical issues for performance. Furthermore thepaper describes our proof-of-concept implementation (SMP Superscalar)and shows preliminary performance results on an SGI Altix 4700.
International Journal of Parallel Programming | 2010
Eduard Ayguadé; Rosa M. Badia; Pieter Bellens; Daniel Cabrera; Alejandro Duran; Roger Ferrer; Marc Gonzàlez; Francisco D. Igual; Daniel Jiménez-González; Jesus Labarta; Luis Martinell; Xavier Martorell; Rafael Mayo; Josep M. Perez; Judit Planas; Enrique S. Quintana-Ortí
This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.
Concurrency and Computation: Practice and Experience | 2009
Rosa M. Badia; José R. Herrero; Jesús Labarta; Josep M. Perez; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí
The promise of future many‐core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP‐like pragmas and a run‐time system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS‐level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column‐major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run‐time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms‐by‐blocks in the FLAME argot) for most operations and storage‐by‐blocks (or block data layout) is already in place. Copyright
international conference on supercomputing | 2010
Josep M. Perez; Rosa M. Badia; Jesus Labarta
The emergence of multicore processors has increased the need for simple parallel programming models usable by nonexperts. The ability to specify subparts of a bigger data structure is an important trait of High Productivity Programming Languages. Such a concept can also be applied to dependency-aware task-parallel programming models. In those paradigms, tasks may have data dependencies, and those are used for scheduling them in parallel. However, calculating dependencies between subparts of bigger data structures is challenging. Accessed data may be strided, and can fully or partially overlap the accesses of other tasks. Techniques that are too approximate may produce too many extra dependencies and limit parallelism. Techniques that are too precise may be impractical in terms of time and space. We present the abstractions, data structures and algorithms to calculate dependencies between tasks with strided and possibly different memory access patterns. Our technique is performed at run time from a description of the inputs and outputs of each task and is not affected by pointer arithmetic nor reshaping. We demonstrate how it can be applied to increase programming productivity. We also demonstrate that scalability is comparable to other solutions and in some cases higher due to better parallelism extraction.
ieee international conference on high performance computing data and analytics | 2009
Pieter Bellens; Josep M. Perez; Felipe Cabarcas; Alex Ramirez; Rosa M. Badia; Jesús Labarta
Cell Superscalars (CellSs) main goal is to provide a simple, flexible and easy programming approach for the Cell Broadband Engine (Cell/B.E.) that automatically exploits the inherent concurrency of the applications at a task level. The CellSs environment is based on a source-to-source compiler that translates annotated C or Fortran code and a runtime library tailored for the Cell/B.E. that takes care of the concurrent execution of the application. The first efforts for task scheduling in CellSs derived from very simple heuristics. This paper presents new scheduling techniques that have been developed for CellSs for the purpose of improving an applications performance. Additionally, the design of a new scheduling algorithm is detailed and the algorithm evaluated. The CellSs scheduler takes an extension of the memory hierarchy for Cell/B.E. into account, with a cache memory shared between the SPEs. All new scheduling practices have been evaluated showing better behavior of our system.
Concurrency and Computation: Practice and Experience | 2006
Raül Sirvent; Josep M. Perez; Rosa M. Badia; Jesús Labarta
GRID superscalar is a Grid programming environment that enables one to parallelize the execution of sequential applications in computational Grids. The run‐time library automatically builds a task data‐dependence graph of the application and it can be seen as an implicit workflow system. The current interface supports C/C++ and Perl applications. The run‐time library is based on Globus Toolkit 2.x using GRAM and GSIFTP services. In this document we describe the GRID superscalar basics emphasizing those aspects related to Grid workflow, in particular the flexibility of using an imperative language to describe the application. Copyright