Rafael Asenjo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rafael Asenjo is active.

Explore More

Publication

Featured researches published by Rafael Asenjo.

parallel computing | 2000

Automatic parallelization of irregular applications

Eladio Gutiérrez; Rafael Asenjo; Oscar G. Plata; Emilio L. Zapata

Abstract Parallel computers are present in a variety of fields, having reached a high degree of architectural maturity. However, there is still a lack of convenient software support for implementing efficient parallel applications. This is specially true for the class of irregular applications, whose computational constructs hardly fit current parallel architectures. In fact, contemporary automatic parallelizers produce, in general, poor parallel code from these applications. This paper discusses techniques and methods to help improve the quality of automatic parallel programs. We focus on two issues: parallelism detection and parallelism implementation. The first issue refers to the detection of specific irregular computation constructs or data access patterns. The second issue considers the case that some frequent construct has been detected but has been sub-optimally parallelized. Both issues are dealt with in depth and in the context of sparse computations (for the first issue) and irregular histogram reductions (for the second issue).

parallel computing | 1995

Sparse Block and Cyclic Data Distributions for Matrix Computations

Rafael Asenjo; Luis F. Romero; Manuel Ujaldon; Emilio L. Zapata

A significant part of scientific codes consist of sparse matrix computations. In this work we propose two new pseudoregular data distributions for sparse matrices. The Multiple Recursive Decomposition (MRD) partitions the data using the prime factors of the dimensions of a multiprocessor network with mesh topology. Furthermore, we introduce a new storage scheme, storage-by-row-of-blocks, that significantly increases the efficiency of the Scatter distribution. We will name Block Row Scatter (BRS) distribution this new variant. The MRD and BRS methods achieve results that improve those obtained by other analyzed methods, being their implementation easier. In fact, the data distributions resulting from the MRD and BRS methods are a generalization of the Block and Cyclic distributions used in dense matrices.

international conference on supercomputing | 1999

New shape analysis techniques for automatic parallelization of C codes

Francisco Corbera; Rafael Asenjo; Emilio L. Zapata

Automatic parallelization of codes with complex data structures is becoming very important. These complex, and often, recursive data structures are widely used in scientific computing. Shape analysis is one of the key steps in the automatic parallelization of such codes. In this paper we extend the Static Shape Graph (SSG) method to enable the successful and accurate detection of complex doubly linked structures. In addition, these techniques have been implemented in a compiler, which has been validated for several C codes. In particular, we present the results the compiler achieves for the C sparse LU factorization algorithm. The output SSG for this case study perfectly describes the complex data structure used during the LU code.

The Journal of Supercomputing | 2014

Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures

Angeles G. Navarro; Antonio Vilches; Francisco Corbera; Rafael Asenjo

This paper explores the possibility of efficiently executing a single application using multicores simultaneously with multiple GPU accelerators under a parallel task programming paradigm. In particular, we address the challenge of extending a parallel_for template to allow its exploitation on heterogeneous architectures. Due to the asymmetry of the computing resources, we propose in this work a dynamic scheduling strategy coupled with an adaptive partitioning scheme that resizes chunks to prevent underutilization and load imbalance of CPUs and GPUs. In this paper we also address the problem of the underutilization of the CPU core where a host thread operates. To solve it, we propose two different approaches: (1) a collaborative host thread strategy, in which the host thread, instead of busy-waiting for the GPU to complete, it carries out useful chunk processing; and (2) a host thread blocking strategy combined with oversubscription, that delegates on the OS the duty of scheduling threads to available CPU cores in order to guarantee that all cores are doing useful work. Using two benchmarks we evaluate the overhead introduced by our scheduling and partitioning algorithms, finding that it is negligible. We also evaluate the efficiency of the strategies proposed finding that allowing oversubscription controlled by the OS can be beneficial under certain scenarios.

symposium on computer architecture and high performance computing | 2012

Global Data Re-allocation via Communication Aggregation in Chapel

Alberto Sanz; Rafael Asenjo; Juan Torres López; Rafael Larrosa; Angeles G. Navarro; Vassily Litvinov; Sung-Eun Choi; Bradford L. Chamberlain

Chapel is a parallel programming language designed to improve the productivity and ease of use of conventional and parallel computers. This language currently delivers sub optimal performance when executing codes that perform global data re-allocation operations on distributed memory architectures. This is mainly due to data communication that is done without aggregation (one message for each remote array element). In this work, we analyze Chapels standard Block and Cyclic distribution modules and optimize the communication routines for array assignments by performing aggregation. Thanks to the expressive power of Chapel, the compiler and runtime have enough information to do communication aggregation without user intervention. The runtime relies on the low-level GAS Net networking layer, whose versions of one-sided bulk put/get routines that support strides are particularly useful for us. Experimental results conducted on Hector (a Cray XE6) and Jaguar (Cray XK6)reveal that the implemented techniques can lead to significant reductions in communication time.

IEEE Transactions on Parallel and Distributed Systems | 2004

A framework to capture dynamic data structures in pointer-based codes

Francisco Corbera; Rafael Asenjo; Emilio L. Zapata

To successfully exploit all the possibilities of current computer/multicomputer architectures, optimization compiling techniques are a must. However, for codes based on pointers and dynamic data structures, these optimization techniques have to be necessarily carried out after identifying the characteristics and properties of the data structure used in the code. We describe the framework and the analyzer we have implemented to capture complex data structures generated, traversed, and modified in codes based on pointers. Our method assigns a reduced set of reference shape graph (RSRSG) to each statement to approximate the shape of the data structure after the execution of such a statement. With the properties and operations that define the behavior of our RSRSG, the method can accurately detect complex recursive data structures such as a doubly linked list of pointers to trees where the leaves point to additional lists. Several experiments are carried out with real codes to validate the capabilities of our analyzer.

high performance computing and communications | 2010

Evaluation of the Task Programming Model in the Parallelization of Wavefront Problems

Antonio J. Dios; Rafael Asenjo; Angeles G. Navarro; Francisco Corbera; Emilio L. Zapata

This paper analyzes the applicability of the task programming model in the parallelization of generic wave front problems. Computations on this type of problems are characterized by a data dependency pattern across a data space, which can produce a variable number of independent tasks through the traversal of such space. Precisely, we think that it is better to formulate the parallelization of this wave front-based programs in terms of logical tasks, instead of threads for several reasons: more efficient matching of computations to available resources, faster start-up and creation task times, improved load balancing and higher level thinking. To implement the parallel wave front we have used two state-of-the art task libraries: TBB and OpenMP 3.0. In this work, we highlight the differences between both implementations, from a programmer standpoint and from the performance point of view. For it, we conduct several experiments to identify the factors that can limit the performance on each case. Besides, we present in the paper a wave front template based on tasks, template that makes easier the coding of parallel wave front codes. We have validated this template with three real dynamic programming algorithms, finding that the TBB-coded template always outperforms the OpenMP based-one.

IEEE Transactions on Parallel and Distributed Systems | 2016

Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors

Antonio Vilches; Angeles G. Navarro; Rafael Asenjo; Francisco Corbera; Ruben Gran; María Jesús Garzarán

In this paper, we consider the problem of efficiently executing streaming applications on commodity processors composed of several cores and an on-chip GPU. Streaming applications, such as those in vision and video analytic, consist of a pipeline of stages and are good candidates to take advantage of this type of platforms. We also consider that characteristics of the input may change while the application is running. Therefore, we propose a framework that adaptively finds the optimal mapping of the pipeline stages. The core of the framework is an analytical model coupled with information collected at runtime used to dynamically map each pipeline stage to the most efficient device, taking into consideration both performance and energy. Our experimental results show that for the evaluated applications running on two different architectures, our model always predicts the best configuration among the evaluated alternatives, and significantly reduces the amount of information that needs to be collected at runtime. This best configuration has, on the average, 20 percent higher throughput than the configuration recommended by a baseline state of the art approach, while the ratio throughput/energy is 43 percent higher. We have measured improvements in throughput and throughput/energy of up-to 81 and 204 percent, respectively, when the model is used to adapt to a video that changes from low to high definition.

ieee international conference on high performance computing data and analytics | 2004

A new dependence test based on shape analysis for pointer-based codes

Angeles G. Navarro; Francisco Corbera; Rafael Asenjo; Adrian Tineo; Oscar G. Plata; Emilio L. Zapata

The approach presented in this paper focus on detecting data dependences induced by heap-directed pointers on loops that access dynamic data structures. Knowledge about the shape of the data structure accessible from a heap-directed pointer, provides critical information for disambiguating heap accesses originating from it. Our approach is based on a previously developed shape analysis that maintains topological information of the connections among the different nodes (memory locations) in the data structure. Basically, the novelty is that our approach carries out abstract interpretation of the statements being analyzed, and let us annotate the memory locations reached by each statement with read/write information. This information will be later used in order to find dependences in a very accurate dependence test which we introduce in this paper.

parallel computing | 1999

Data-parallel support for numerical irregular problems

Emilio L. Zapata; Oscar G. Plata; G.P. Trabado; Rafael Asenjo

A large class of intensive numerical applications show an irregular structure, exhibiting an unpredictable runtime behavior. Two kinds of irregularity can be distinguished in these applications. First, irregular control structures, derived from the use of conditional statements on data only known at runtime. Second, irregular data structures, derived from computations involving sparse matrices, grids, trees, graphs, etc. Many of these applications exhibit a large amount of parallelism, but the above features usually make that exploiting such parallelism becomes a very diAcult task. This paper discusses the eAective parallelization of numerical irregular codes, focusing on the definition and use of data-parallel extensions to express the parallelism that they exhibit. We show that the combination of data distributions with storage structures allows to obtain eAcient parallel codes. Codes dealing with sparse matrices, finite element methods and molecular dynamics (MD) simulations are taken as working examples. ” 1999 Elsevier Science B.V. All rights reserved.

Explore More