Sergi Mateo
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sergi Mateo.
symposium on computer architecture and high performance computing | 2014
Florentino Sainz; Sergi Mateo; Vicenç Beltran; José Luis Bosque; Xavier Martorell; Eduard Ayguadé
CUDA and OpenCL are the most widely used programming models to exploit hardware accelerators. Both programming models provide a C-based programming language to write accelerator kernels and a host API used to glue the host and kernel parts. Although this model is a clear improvement over a low-level and ad-hoc programming model for each hardware accelerator, it is still too complex and cumbersome for general adoption. For large and complex applications using several accelerators, the main problem becomes the explicit coordination and management of resources required between the host and the hardware accelerators that introduce a new family of issues (scheduling, data transfers, synchronization, ) that the programmer must take into account. In this paper, we propose a simple extension to OmpSs -- a data-flow programming model -- that dramatically simplifies the integration of accelerated code, in the form of CUDA or OpenCL kernels, into any C, C++ or Fortran application. Our proposal fully replaces the CUDA and OpenCL host APIs with a few pragmas, so we can leverage any kernel written in CUDA C or OpenCL C without any performance impact. Our compiler generates all the boilerplat code while our runtime system takes care of kernels scheduling, data transfers between host and accelerators and synchronizations between host and kernels parts. To evaluate our approach, we have ported several native CUDA and OpenCL applications to OmpSs by replacing all the CUDA or OpenCL API calls by a few number of pragmas. The OmpSs versions of these applications have competitive performance and scalability but with a significantly lower complexity than the original ones.
international workshop on openmp | 2014
Jan Ciesko; Sergi Mateo; Xavier Teruel; Vicenç Beltran; Xavier Martorell; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta
The wide adoption of parallel processing hardware in mainstream computing as well as the raising interest for efficient parallel programming in the developer community increase the demand for parallel programming model support for common algorithmic patterns. In this work we present an extension to the OpenMP task construct to add support for reductions in while-loops and general-recursive algorithms. Further we discuss implications on the OpenMP standard and present a prototype implementation in OmpSs. Benchmark results confirm applicability of this approach and scalability on current SMP systems.
ieee international conference on high performance computing data and analytics | 2014
Alejandro Fernández; Vicenç Beltran; Sergi Mateo; Tomasz Patejko; Eduard Ayguadé
Developing complex scientific applications on high performance systems requires both domain knowledge and expertise in parallel and distributed programming models. In addition, modern high performance systems are heterogeneous, thus composed of multicores and accelerators, which despite being efficient and powerful, are harder to program. Domain-Specific Languages (DSLs) are a promising approach to hide the complexity of HPC systems and boost programmers productivity. However, the huge cost and complexity of implementing efficient and scalable DSLs on HPC systems is hindering its adoption for most domains. Addressing such problems, we present Data Flow Language (DFL), a DSL designed to exploit distributed and heterogeneous HPC systems. DFL abstracts the key concepts such systems as SMP tasks for multicores, kernels for accelerators and high-level operations for distributed computing. In addition, DFL leverages the hybrid MPI/OmpSs data-flow programming model to efficiently implement the previous concepts. All of these features make DFL suitable as the target language for other DSLs. However, it is also suitable as a fast prototyping language to develop distributed applications on heterogeneous systems.
international workshop on openmp | 2017
Antoni Navarro; Sergi Mateo; Josep M. Perez; Vicenç Beltran; Eduard Ayguadé
In the last few decades, modern applications have become larger and more complex. Among the users of these applications, the need to simplify the process of identifying units of work increased as well. With the approach of tasking models, this want has been satisfied. These models make scheduling units of work much more user-friendly. However, with the arrival of tasking models, came granularity management. Discovering an application’s optimal granularity is a frequent and sometimes challenging task for a wide range of recursive algorithms. Often, finding the optimal granularity will cause a substantial increase in performance.
international workshop on openmp | 2016
Guray Ozen; Sergi Mateo; Eduard Ayguadé; Jesús Labarta; James Beyer
The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.
international workshop on openmp | 2016
Christian Terboven; Jonas Hahnfeld; Xavier Teruel; Sergi Mateo; Alejandro Duran; Michael Klemm; Stephen L. Olivier; Bronis R. de Supinski
OpenMP tasking supports parallelization of irregular algorithms. Recent OpenMP specifications extended tasking to increase functionality and to support optimizations, for instance with the taskloop construct. However, task scheduling remains opaque, which leads to inconsistent performance on NUMA architectures. We assess design issues for task affinity and explore several approaches to enable it. We evaluate these proposals with implementations in the Nanos++ and LLVM OpenMP runtimes that improve performance up to 40 % and significantly reduce execution time variation.
ieee high performance extreme computing conference | 2015
Jan Ciesko; Sergi Mateo; Xavier Teruel; Vicenç Beltran; Xavier Martorell; Jesús Labarta
Array-type reductions represent a frequently occurring algorithmic pattern in many scientific applications. A special case occurs if array elements are accessed in an irregular, often random manner, making their concurrent and scalable execution difficult. In this work we present a new approach that consists of language and runtime support and targets popular parallel programming models such as OpenMP. Its runtime support implements Privatization with In-lined, Block-Ordered Reductions (PIBOR), a new approach that trades processor cycles to increase locality and bandwidth efficiency for such algorithms. A reference implementation in OmpSs, a task-parallel programming model, shows promising results on current multicore systems.
symposium on code generation and optimization | 2018
Sandra Macià; Sergi Mateo; Pedro J. Martínez-Ferrer; Vicenç Beltran; Daniel Mira; Eduard Ayguadé
Nowadays high-performance computing is taking an increasingly central role in scientific research while computer architectures are becoming more heterogeneous and complex with different parallel programming models and techniques. Under this scenario, the only way to successfully exploit a high-performance computing system requires that computer and domain scientists work closely towards producing applications to solve domain problems, ensuring productivity and performance at the same time. Facing such purpose, Saiph is a domain specific language designed to ease the task of simulating complex systems characterized by partial differential equations, focused on computational fluid dynamics. Saiph allows to model real physical systems providing numerical method and high-performance computing expertise in a transparent and automated fashion.
international workshop on openmp | 2018
Jannis Klinkenberg; Philipp Samfass; Christian Terboven; Alejandro Duran; Michael Klemm; Xavier Teruel; Sergi Mateo; Stephen L. Olivier; Matthias S. Müller
In modern shared-memory NUMA systems which typically consist of two or more multi-core processor packages with local memory, affinity of data to computation is crucial for achieving high performance with an OpenMP program. OpenMP* 3.0 introduced support for task-parallel programs in 2008 and has continued to extend its applicability and expressiveness. However, the ability to support data affinity of tasks is missing. In this paper, we investigate several approaches for task-to-data affinity that combine locality-aware task distribution and task stealing. We introduce the task affinity clause that will be part of OpenMP 5.0 and provide the reasoning behind its design. Evaluation with our experimental implementation in the LLVM OpenMP runtime shows that task affinity improves execution performance up to 4.5x on an 8-socket NUMA machine and significantly reduces runtime variability of OpenMP tasks. Our results demonstrate that a variety of applications can benefit from task affinity and that the presented clause is closing the gap of task-to-data affinity in OpenMP 5.0.
symposium on computer architecture and high performance computing | 2017
Borja Pérez; Esteban Stafford; José Luis Bosque; Ramón Beivide; Sergi Mateo; Xavier Teruel; Xavier Martorell; Eduard Ayguadé
Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.