Alejandro Duran | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alejandro Duran is active.

Explore More

Publication

Featured researches published by Alejandro Duran.

international workshop on openmp | 2008

Extending the OpenMP tasking model to allow dependent tasks

Alejandro Duran; Josep M. Perez; Eduard Ayguadé; Rosa M. Badia; Jesús Labarta

Tasking in OpenMP 3.0 has been conceived to handle the dynamicgeneration of unstructured parallelism. New directives have beenadded allowing the user to identify units of independent work (tasks) andto define points to wait for the completion of tasks (task barriers). Inthis paper we propose an extension to allow the runtime detection of dependenciesbetween generated tasks, broading the range of applicationsthat can benefit from tasking or improving the performance when loadbalancing or locality are critical issues for performance. Furthermore thepaper describes our proof-of-concept implementation (SMP Superscalar)and shows preliminary performance results on an SGI Altix 4700.

international conference on high performance computing and simulation | 2012

The Intel® Many Integrated Core Architecture

Alejandro Duran; Michael Klemm

In recent years, an observable trend in High Performance Computing (HPC) architectures has been the inclusion of accelerators, such as GPUs and field programmable arrays (FPGAs), to improve the performance of scientific applications. To rise to this challenge Intel announced the Intel® Many Integrated Core Architecture (Intel® MIC Architecture). In contrast with other accelerated platforms, the Intel MIC Architecture is a general purpose, manycore coprocessor that improves the programmability of such devices by supporting the well-known shared-memory execution model that is the base of most nodes in HPC machines. In this presentation, we will introduce key properties of the Intel MIC Architecture and we will also cover programming models for parallelization and vectorization of applications targeting this architecture.

international conference on parallel processing | 2011

Productive cluster programming with OmpSs

Javier Bueno; Luis Martinell; Alejandro Duran; Montse Farreras; Xavier Martorell; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta

Clusters of SMPs are ubiquitous. They have been traditionally programmed by using MPI. But, the productivity of MPI programmers is low because of the complexity of expressing parallelism and communication, and the difficulty of debugging. To try to ease the burden on the programmer new programming models have tried to give the illusion of a global shared-address space (e.g., UPC, Co-array Fortran). Unfortunately, these models do not support, increasingly common, irregular forms of parallelism that require asynchronous task parallelism. Other models, such as X10 or Chapel, provide this asynchronous parallelism but the programmer is required to rewrite entirely his application. We present the implementation of OmpSs for clusters, a variant of OpenMP extended to support asynchrony, heterogeneity and data movement for task parallelism. As OpenMP, it is based on decorating an existing serial version with compiler directives that are translated into calls to a runtime system that manages the parallelism extraction and data coherence and movement. Thus, the same program written in OmpSs can run in a regular SMP machine, in clusters of SMPs, or even can be used for debugging with the serial version. The runtime uses the information provided by the programmer to distribute the work across the cluster while optimizes communications using affinity scheduling and caching of data. We have evaluated our proposal with a set of kernels and the OmpSs versions obtain a performance comparable, or even superior, to the one obtained by the same version of MPI.

International Journal of Parallel Programming | 2010

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Eduard Ayguadé; Rosa M. Badia; Pieter Bellens; Daniel Cabrera; Alejandro Duran; Roger Ferrer; Marc Gonzàlez; Francisco D. Igual; Daniel Jiménez-González; Jesus Labarta; Luis Martinell; Xavier Martorell; Rafael Mayo; Josep M. Perez; Judit Planas; Enrique S. Quintana-Ortí

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

international symposium on performance analysis of systems and software | 2011

Trace-driven simulation of multithreaded applications

Alejandro Rico; Alejandro Duran; Felipe Cabarcas; Yoav Etsion; Alex Ramirez; Mateo Valero

Over the past few years, computer architecture research has moved towards execution-driven simulation, due to the inability of traces to capture timing-dependent thread execution interleaving. However, trace-driven simulation has many advantages over execution-driven that are being missed in multithreaded application simulations. We present a methodology to properly simulate multithreaded applications using trace-driven environments. We distinguish the intrinsic application behavior from the computation for managing parallelism. Application traces capture the intrinsic behavior in the sections of code that are independent from the dynamic multithreaded nature, and the points where parallelism-management computation occurs. The simulation framework is composed of a trace-driven simulation engine and a dynamic-behavior component that implements the parallelism-management operations for the application. Then, at simulation time, these operations are reproduced by invoking their implementation in the dynamic-behavior component. The decisions made by these operations are based on the simulated architecture, allowing to dynamically reschedule sections of code taken from the trace to the target simulated components. As the captured sections of code are independent from the parallel state of the application, they can be simulated on the trace-driven engine, while the parallelism-management operations, that require to be re-executed, are carried out by the execution-driven component, thus achieving the best of both trace- and execution-driven worlds. This simulation methodology creates several new research opportunities, including research on scheduling and other parallelism-management techniques for future architectures, and hardware support for programming models.

International Journal of Parallel Programming | 2009

A proposal to extend the OpenMP tasking model with dependent tasks

Alejandro Duran; Roger Ferrer; Eduard Ayguadé; Rosa M. Badia; Jesús Labarta

Tasking in OpenMP 3.0 has been conceived to handle the dynamic generation of unstructured parallelism. New directives have been added allowing the user to identify units of independent work (tasks) and to define points to wait for the completion of tasks (task barriers). In this document we propose extensions to allow the runtime detection of dependencies between generated tasks, broading the range of applications that can benefit from tasking or improving the performance when load balancing or locality are critical issues for performance. The proposed extensions are evaluated on a SGI Altix multiprocessor architecture using a couple of small applications and a prototype runtime system implementation.

languages and compilers for parallel computing | 2010

Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

Roger Ferrer; Judit Planas; Pieter Bellens; Alejandro Duran; Marc Gonzàlez; Xavier Martorell; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta

In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, and Julia Set. We compare the results obtained with the execution of the same benchmarks written in OpenCL, in the same architectures. The results show that OMPSs greatly outperforms the OpenCL environment. It is more flexible to exploit multiple accelerators. And due to the simplicity of the annotations, it increases programmers productivity.

international workshop on openmp | 2012

Extending OpenMP* with vector constructs for modern multicore SIMD architectures

Michael Klemm; Alejandro Duran; Xinmin Tian; Hideki Saito; Diego Caballero; Xavier Martorell

In order to obtain maximum performance, many applications require to extend parallelism from multi-threading to instruction-level (SIMD) parallelism that exists in many current (and future) multi-core architectures. While auto-vectorization technology has been used to exploit this SIMD level, it is not always enough due to OpenMP semantics and compiler technology limitations. In those cases, programmers need to resort to low-level intrinsics or vendor specific directives. We propose a new OpenMP directive: the simd directive. This directive will allow programmers to guide the vectorization process enabling a more productive and portable exploitation of the SIMD level. Our performance results show significant improvements over current auto-vectorizing technology of the Intel® Composer XE 2011.

languages and compilers for parallel computing | 2009

Unrolling loops containing task parallelism

Roger Ferrer; Alejandro Duran; Xavier Martorell; Eduard Ayguadé

Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains parallelism inside most compilers will ignore it or perform a naive transformation. We propose to extend the semantics of the loop unrolling transformation to cover loops that contain task parallelism. In these cases, the transformation will try to aggregate the multiple tasks that appear after a classic unrolling phase to reduce the overheads per iteration. We present an implementation of such extended loop unrolling for OpenMP tasks with two phases: a classical unroll followed by a task aggregation phase. Our aggregation technique covers the special cases where task parallelism appears inside branches or where the loop is uncountable. Our experimental results show that using this extended unroll allows loops with fine-grained tasks to reduce the overheads associated with task creation and obtain a much better scaling.

ieee international conference on high performance computing data and analytics | 2016

Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tiling

Charles R. Yount; Alejandro Duran

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. The performance of stencil calculations is often bounded by memory bandwidth. High-bandwidth memory (HBM) on devices such as those in the Intel® Xeon Phi™ ™200 processor family (code-named Knights Landing) can thus provide additional performance. In a traditional sequential time-step approach, the additional bandwidth can be best utilized when the stencil data fits into the HBM, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As problem sizes become significantly larger than the HBM, the effective bandwidth approaches that of the DDR, degrading performance. This paper explores the use of temporal wave-front tiling to add an additional layer of cache-blocking to allow efficient use of both the HBM bandwidth and the DDR capacity. Details of the cache-blocking and wave-front tiling algorithms are given, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and HBM-cache hit rates are also provided, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide a 2.4™ speedup compared to using HBM cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes.

Explore More