Eduard Ayguadé | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eduard Ayguadé is active.

Explore More

Publication

Featured researches published by Eduard Ayguadé.

Parallel Processing Letters | 2011

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES

Alejandro Duran; Eduard Ayguadé; Rosa M. Badia; Jesús Labarta; Luis Martinell; Xavier Martorell; Judit Planas

In this paper, we present OmpSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on different architectures, SMP, GPUs, and hybrid SMP/GPU environments, showing the wide usefulness of the approach. The evaluation is done with six different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, Julia Set, PBPI and FixedGrid. We compare the results obtained with the execution of the same benchmarks written in OpenCL or OpenMP, on the same architectures. The results show that OmpSs greatly outperforms both environments. With the use of OmpSs the programming environment is more flexible than traditional approaches to exploit multiple accelerators, and due to the simplicity of the annotations, it increases programmers productivity.

IEEE Transactions on Parallel and Distributed Systems | 2009

The Design of OpenMP Tasks

Eduard Ayguadé; Nawal Copty; Alejandro Duran; Jay Hoeflinger; Yuan Lin; Federico Massaioli; Xavier Teruel; Priya Unnikrishnan; Guansong Zhang

OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application complexity, there is a growing need for addressing irregular parallelism in the presence of complicated control structures. This is evident in various efforts by the industry and research communities to provide a solution to this challenging problem. One of the primary goals of OpenMP 3.0 is to define a standard dialect to express and efficiently exploit unstructured parallelism. This paper presents the design of the OpenMP tasking model by members of the OpenMP 3.0 tasking sub-committee which was formed for this purpose. The paper summarizes the efforts of the sub-committee (spanning over two years) in designing, evaluating and seamlessly integrating the tasking model into the OpenMP specification. In this paper, we present the design goals and key features of the tasking model, including a rich set of examples and an in-depth discussion of the rationale behind various design choices. We compare a prototype implementation of the tasking model with existing models, and evaluate it on a wide range of applications. The comparison shows that the OpenMP tasking model provides expressiveness, flexibility, and huge potential for performance and scalability.

international conference on parallel processing | 2009

Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP

Alejandro Duran; Xavier Teruel; Roger Ferrer; Xavier Martorell; Eduard Ayguadé

Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only a few applications exploit sections parallelism. With the release of the new OpenMP specification (3.0), this programming model supports tasking. Parallel tasks allow the exploitation of irregular parallelism, but there is a lack of benchmarks exploiting tasks in OpenMP. With the current (and projected) multicore architectures that offer many more alternatives to execute parallel applications than traditional SMP machines, this kind of parallelism is increasingly important. And so, the need to have some set of benchmarks to evaluate it. In this paper, we motivate the need of having such a benchmarks suite, for irregular and/or recursive task parallelism. We present our proposal, the Barcelona OpenMP Tasks Suite (BOTS), with a set of applications exploiting regular and irregular parallelism, based on tasks. We present an overall evaluation of the BOTS benchmarks in an Altix system and we discuss some of the different experiments that can be done with the different compilation and runtime alternatives of the benchmarks.

ieee international conference on high performance computing data and analytics | 2009

Hierarchical Task-Based Programming With StarSs

Judit Planas; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta

Programming models for multicore and many-core systems are listed as one of the main challenges in the near future for computing research. These programming models should be able to exploit the underlying platform, but also should have good programmability to enable programmer productivity. With respect to the heterogeneity and hierarchy of the underlying platforms, the programming models should take them into account but they should also enable the programmer to be unaware of the complexity of the hardware. In this paper we present an extension of the StarSs syntax to support task hierarchy. A motivation for such a hierarchical approach is presented through experimentation with CellSs. A prototype implementation of such a hierarchical task-based programming model that combines a first task level with SMPSs and a second task level with CellSs is presented. The preliminary results obtained when executing a matrix multiplication and a Cholesky factorization show the viability and potential of the approach and the current issues raised.

IEEE Computer Architecture Letters | 2006

Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors

Tomer Y. Morad; Uri C. Weiser; A. Kolodnyt; Mateo Valero; Eduard Ayguadé

This paper evaluates asymmetric cluster chip multiprocessor (ACCMP) architectures as a mechanism to achieve the highest performance for a given power budget. ACCMPs execute serial phases of multithreaded programs on large high-performance cores whereas parallel phases are executed on a mix of large and many small simple cores. Theoretical analysis reveals a performance upper bound for symmetric multiprocessors, which is surpassed by asymmetric configurations at certain power ranges. Our emulations show that asymmetric multiprocessors can reduce power consumption by more than two thirds with similar performance compared to symmetric multiprocessors

european conference on parallel processing | 2009

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Eduard Ayguadé; Rosa M. Badia; Francisco D. Igual; Jesús Labarta; Rafael Mayo; Enrique S. Quintana-Ortí

While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmers productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.

international conference on parallel architectures and compilation techniques | 1996

Swing module scheduling: a lifetime-sensitive approach

Josep Llosa; Antonio González; Eduard Ayguadé; Mateo Valero

This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements and stage count. Swing Modulo Scheduling is an heuristic approach that has a low computational cost. The paper describes the technique and evaluates it for the Perfect Club benchmark suite. SMS is compared with other heuristic methods showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. SMS is also compared with an integer linear programming approach that generates optimum schedules but with a huge computational cost, which makes it feasible only for very small loops. For a set of small loops, SMS obtained the optimum initiation interval in all the cases and its schedules required only 5% more registers and a 1% higher stage count than the optimum.

international middleware conference | 2011

Resource-aware adaptive scheduling for mapreduce clusters

Jorda Polo; Claris Castillo; David Carrera; Yolanda Becerra; Ian Whalley; Malgorzata Steinder; Jordi Torres; Eduard Ayguadé

We present a resource-aware scheduling technique for MapReduce multi-job workloads that aims at improving resource utilization across machines while observing completion time goals. Existing MapReduce schedulers define a static number of slots to represent the capacity of a cluster, creating a fixed number of execution slots per machine. This abstraction works for homogeneous workloads, but fails to capture the different resource requirements of individual jobs in multi-user environments. Our technique leverages job profiling information to dynamically adjust the number of slots on each machine, as well as workload placement across them, to maximize the resource utilization of the cluster. In addition, our technique is guided by user-provided completion time goals for each job. Source code of our prototype is available at [1].

international conference on supercomputing | 2010

Decomposable and responsive power models for multicore processors using performance counters

Ramon Bertran; Marc Gonzàlez; Xavier Martorell; Nacho Navarro; Eduard Ayguadé

Power modeling based on performance monitoring counters (PMCs) attracted the interest of researchers since it became a quick approach to understand and analyse power behavior on real systems. As a result, several power-aware policies use power models to guide their decisions and to trigger low-level mechanisms such as voltage and frequency scaling. Hence, the presence of power models that are informative, accurate and capable of detecting power phases is critical to increase the power-aware research chances and to improve the success of power-saving techniques based on them. In addition, the design of current processors has varied considerably with the inclusion of multiple cores with some resources shared on a single die. As a result, PMC-based power models warrant further investigation on current energy-efficient multi-core processors. In this paper, we present a methodology to produce decomposable PMC-based power models on current multicore architectures. Apart from being able to estimate the power consumption accurately, the models provide per component power consumption, supplying extra insights about power behavior. Moreover, we validate their responsiveness -the capacity to detect power phases-. Specifically, we produce a set of power models for an Intel® Core#8482; 2 Duo. We model one and two cores for a wide set of DVFS configurations. The models are empirically validated by using the SPEC-cpu2006 benchmark suite and we compare them to other models built using existing approaches. Overall, we demonstrate that the proposed methodology produces more accurate and responsive power models. Concretely, our models show a [1.89--6]% error range and almost 100% accuracy in detecting phase variations above 0.5 watts.

international parallel and distributed processing symposium | 2012

Productive Programming of GPU Clusters with OmpSs

Javier Bueno; Judit Planas; Alejandro Duran; Rosa M. Badia; Xavier Martorell; Eduard Ayguadé; Jesús Labarta

Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applications programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

Explore More