Manuel Arenaz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manuel Arenaz is active.

Explore More

Publication

Featured researches published by Manuel Arenaz.

ACM Transactions on Programming Languages and Systems | 2008

XARK: An extensible framework for automatic recognition of computational kernels

Manuel Arenaz; Juan Touriño; Ramón Doallo

The recognition of program constructs that are frequently used by software developers is a powerful mechanism for optimizing and parallelizing compilers to improve the performance of the object code. The development of techniques for automatic recognition of computational kernels such as inductions, reductions and array recurrences has been an intensive research area in the scope of compiler technology during the 90s. This article presents a new compiler framework that, unlike previous techniques that focus on specific and isolated kernels, recognizes a comprehensive collection of computational kernels that appear frequently in full-scale real applications. The XARK compiler operates on top of the Gated Single Assignment (GSA) form of a high-level intermediate representation (IR) of the source code. Recognition is carried out through a demand-driven analysis of this high-level IR at two different levels. First, the dependences between the statements that compose the strongly connected components (SCCs) of the data-dependence graph of the GSA form are analyzed. As a result of this intra-SCC analysis, the computational kernels corresponding to the execution of the statements of the SCCs are recognized. Second, the dependences between statements of different SCCs are examined in order to recognize more complex kernels that result from combining simpler kernels in the same code. Overall, the XARK compiler builds a hierarchical representation of the source code as kernels and dependence relationships between those kernels. This article describes in detail the collection of computational kernels recognized by the XARK compiler. Besides, the internals of the recognition algorithms are presented. The design of the algorithms enables to extend the recognition capabilities of XARK to cope with new kernels, and provides an advanced symbolic analysis framework to run other compiler techniques on demand. Finally, extensive experiments showing the effectiveness of XARK for a collection of benchmarks from different application domains are presented. In particular, the SparsKit-II library for the manipulation of sparse matrices, the Perfect benchmarks, the SPEC CPU2000 collection and the PLTMG package for solving elliptic partial differential equations are analyzed in detail.

Concurrency and Computation: Practice and Experience | 2007

Automated and accurate cache behavior analysis for codes with irregular access patterns

Diego Andrade; Manuel Arenaz; Basilio B. Fraguela; Juan Touriño; Ramón Doallo

The memory hierarchy plays an essential role in the performance of current computers, so good analysis tools that help in predicting and understanding its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application can be overcome. While there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to codes with irregular patterns due to the increased difficulty in analyzing them. Nevertheless, many important applications exhibit this kind of pattern, and their lack of locality make them more cache‐demanding, which makes their study more relevant. The focus of this paper is the automation of the Probabilistic Miss Equations (PME) model, an analytical model of the cache behavior that provides fast and accurate predictions for codes with irregular access patterns. The information requirements of the PME model are defined and its integration in the XARK compiler, a research compiler oriented to automatic kernel recognition in scientific codes, is described. We show how to exploit the powerful information‐gathering capabilities provided by this compiler to allow the automated modeling of loop‐oriented scientific codes. Experimental results that validate the correctness of the automated PME model are also presented. Copyright

international conference on supercomputing | 2003

A GSA-based compiler infrastructure to extract parallelism from complex loops

Manuel Arenaz; Juan Touriño; Ramón Doallo

This paper presents a new approach for the detection of coarse-grain parallelism in loop nests that contain complex computations, including subscripted subscripts as well as conditional statements that introduce complex control flows at run-time. The approach is based on the recognition of the computational kernels calculated in a loop without considering the semantics of the code. The detection is carried out on top of the Gated Single Assignment (GSA) program representation at two different levels. First, the use-def chains between the statements that compose the strongly connected components (SCCs) of the GSA use-def chain graph are analyzed (intra-SCC analysis). As a result, the kernel computed in each SCC is recognized. Second, the use-def chains between statements of different SCCs are examined (inter-SCC analysis). This second abstraction level enables the detection of more complex computational kernels by the compiler. A prototype was implemented using the infrastructure provided by the Polaris compiler. Experimental results that show the effectiveness of our approach for the detection of coarse-grain parallelism in a suite of real codes are presented.

Concurrency and Computation: Practice and Experience | 2013

A multi‐GPU shallow‐water simulation with transport of contaminants

Moisés Viñas; Jacobo Lobeiras; Basilio B. Fraguela; Manuel Arenaz; Margarita Amor; José A. García; Manuel J. Castro; Ramón Doallo

This work presents cost‐effective multi‐graphics processing unit (GPU) parallel implementations of a finite‐volume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow‐water equations, whereas the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first‐order Roe finite‐volume scheme. Specifically, this paper presents multi‐GPU implementations of both a solution that exploits recomputation on the GPU and an optimized solution that is based on a ghost cell decoupling approach. Our multi‐GPU implementations have been optimized using nonblocking communications, overlapping communications and computations and the application of ghost cell expansion to minimize communications. The fastest one reached a speedup of 78 × using four GPUs on an InfiniBand network with respect to a parallel execution on a multicore CPU with six cores and two‐way hyperthreading per core. Such performance, measured using a realistic problem, enabled the calculation of solutions not only in real time but also in orders of magnitude faster than the simulated time.Copyright

IEEE Transactions on Education | 2005

A grid portal for an undergraduate parallel programming course

Juan Touriño; María J. Martín; Jacobo Tarrío; Manuel Arenaz

This paper describes an experience of designing and implementing a portal to support transparent remote access to supercomputing facilities to students enrolled in an undergraduate parallel programming course. As these facilities are heterogeneous, are located at different sites, and belong to different institutions, grid computing technologies have been used to overcome these issues. The result is a grid portal based on a modular and easily extensible software architecture that provides a uniform and user-friendly interface for students to work on their programming laboratory assignments.

international symposium on parallel and distributed processing and applications | 2004

An inspector-executor algorithm for irregular assignment parallelization

Manuel Arenaz; Juan Touriño; Ramón Doallo

A loop with irregular assignment computations contains loop-carried output data dependences that can only be detected at run-time. In this paper, a load-balanced method based on the inspector-executor model is proposed to parallelize this loop pattern. The basic idea lies in splitting the iteration space of the sequential loop into sets of conflict-free iterations that can be executed concurrently on different processors. As will be demonstrated, this method outperforms existing techniques. Irregular access patterns with different load-balancing and reusability properties are considered in the experiments.

parallel computing | 2013

A Novel Compiler Support for Automatic Parallelization on Multicore Systems

José M. Andión; Manuel Arenaz; Gabriel Rodríguez; Juan Touriño

Abstract The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.

parallel computing | 2001

Efficient parallel numerical solver for the elastohydrodynamic Reynolds-Hertz problem

Manuel Arenaz; Ramón Doallo; Juan Touriño; Carlos Vázquez

Abstract This work presents a parallel version of a complex numerical algorithm for solving an elastohydrodynamic piezoviscous lubrication problem studied in tribology. The numerical algorithm combines regula falsi, fixed point techniques, finite elements and duality methods. The execution of the sequential program on a workstation requires significant CPU time and memory resources. Thus, in order to reduce the computational cost, we have applied parallelization techniques to the most costly parts of the original source code. Some blocks of the sequential code were also redesigned for the execution on a multicomputer. In this paper, our parallel version is described in detail, execution times that show its efficiency in terms of speedups are presented, and new numerical results that establish the convergence of the algorithm for higher imposed load values when using finer meshes are depicted. As a whole, this paper tries to illustrate the difficulties involved in parallelizing and optimizing complex numerical algorithms based on finite elements.

ieee international conference on high performance computing data and analytics | 2013

Parallelization of shallow water simulations on current multi-threaded systems

Jacobo Lobeiras; Moisés Viñas; Margarita Amor; Basilio B. Fraguela; Manuel Arenaz; José Antonio Orosa García; Manuel J. Castro

In this work, several parallel implementations of a numerical model of pollutant transport on a shallow water system are presented. These parallel implementations are developed in two phases. First, the sequential code is rewritten to exploit the stream programming model. And second, the streamed code is targeted for current multi-threaded systems, in particular, multi-core CPUs and modern GPUs. The performance is evaluated on a multi-core CPU using OpenMP, and on a GPU using the streaming-oriented programming language Brook+, as well as the standard language for heterogeneous systems, OpenCL.

international conference on high performance computing and simulation | 2011

Simulation of pollutant transport in shallow water on a CUDA architecture

Moisés Viñas; Jacobo Lobeiras; Basilio B. Fraguela; Manuel Arenaz; Margarita Amor; Ramón Doallo

Shallow water simulation enables the study of problems such as dam break, river, canal and coastal hydrodynamics, as well as the transport of inert substances, such as pollutants, on a fluid. This article describes a GPU efficient and cost-effective CUDA implementation of a finite volume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow water equations, while the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first order finite volume scheme. The evaluation using a realistic problem shows that the implementation makes a good usage of the computational resources, being very efficient for real-life complex simulations. The speedup reached allowed us to complete a simulation in 2 hours in contrast with the 239 hours (10 days) required by a sequential execution in a standard CPU.

Explore More