Raúl de la Cruz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raúl de la Cruz is active.

Explore More

Publication

Featured researches published by Raúl de la Cruz.

ieee international conference on high performance computing data and analytics | 2009

3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Mauricio Araya-Polo; Felix Rubio; Raúl de la Cruz; Mauricio Hanzich; José María Cela; Daniele Paolo Scarpazza

Reverse-Time Migration (RTM) is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP) in terms of performance (with an 15.0× speedup) and energy-efficiency (with a 10.0× increase in the GFlops/W delivered). Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.

field-programmable technology | 2009

Exploiting memory customization in FPGA for 3D stencil computations

Muhammad Shafiq; Miquel Pericàs; Raúl de la Cruz; Mauricio Araya-Polo; Nacho Navarro; Eduard Ayguadé

3D stencil computations are compute-intensive kernels often appearing in high-performance scientific and engineering applications. The key to efficiency in these memory-bound kernels is full exploitation of data reuse. This paper explores the design aspects for 3D-Stencil implementations that maximize the reuse of all input data on a FPGA architecture. The work focuses on the architectural design of 3D stencils with the form n × (n + 1) × n, where n = {2, 4, 6, 8, ...}. The performance of the architecture is evaluated using two design approaches, ¿Multi-Volume¿ and ¿Single-Volume¿. When n = 8, the designs achieve a sustained throughput of 55.5 GFLOPS in the ¿Single-Volume¿ approach and 103 GFLOPS in the ¿Multi-Volume¿ design approach in a 100-200 MHz multi-rate implementation on a Virtex-4 LX200 FPGA. This corresponds to a stencil data delivery of 1500 bytes/cycle and 2800 bytes/cycle respectively. The implementation is analyzed and compared to two CPU cache approaches and to the statically scheduled local stores on the IBM PowerXCell 8i. The FPGA approaches designed here achieve much higher bandwidth despite the FPGA device being the least recent of the chips considered. These numbers show how a custom memory organization can provide large data throughput when implementing 3D stencil kernels.

international conference on conceptual structures | 2011

Towards a Multi-Level Cache Performance Model for 3D Stencil Computation

Raúl de la Cruz; Mauricio Araya-Polo

Abstract It is crucial to optimize stencil computations since they are the core (and most computational demanding segment) of many Scientific Computing applications, therefore reducing overall execution time. This is not a simple task, actually it is lengthy and tedious. It is lengthy because the large number of stencil optimizations combinations to test, which might consume days of computing time, and the process is tedious due to the slightly different versions of code to implement. Alternatively, models that predict performance can be built without any actual stencil execution, thus reducing the cumbersome optimization task. Previous works have proposed cache misses and execution time models for specific stencil optimizations. Furthermore, most of them have been designed for 2D datasets or stencil sizes that only suit low order numerical schemes. We propose a flexible and accurate model for a wide range of stencil sizes up to high order schemes, that captures the behavior of 3D stencil computations using platform parameters. The model has been tested in a group of representative hardware architectures, using realistic dataset sizes. Our model predicts successfully stencil execution times and cache misses. However, predictions accuracy depends on the platform, for instance on x86 architectures prediction errors ranges between 1-20%. Therefore, the model is reliable and can help to speed up the stencil computation optimization process. To that end, other stencil optimization techniques can be added to this model, thus essentially providing a framework which covers most of the state-of-the-art.

ACM Transactions on Mathematical Software | 2014

Algorithm 942: Semi-Stencil

Raúl de la Cruz; Mauricio Araya-Polo

Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time. Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named Semi-stencil, that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves the memory access pattern and efficiently reuses the accessed data. Our initial target architecture was the Cell/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multicore architectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space- and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76× faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multicore based, we have introduced a brief section where scalability results on IBM POWER7-, Intel Xeon-, and MIC-based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scalability of Semi-stencil on MIC for a certain testcase reached 93.8 × over 244 threads.

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2014

Modeling Stencil Computations on Modern HPC Architectures

Raúl de la Cruz; Mauricio Araya-Polo

Stencil computations are widely used for solving Partial Differential Equations (PDEs) explicitly by Finite Difference schemes. The stencil solver alone -depending on the governing equation- can represent up to 90 % of the overall elapsed time, of which moving data back and forth from memory to CPU is a major concern. Therefore, the development and analysis of source code modifications that can effectively use the memory hierarchy of modern architectures is crucial. Performance models help expose bottlenecks and predict suitable tuning parameters in order to boost stencil performance on any given platform. To achieve that, the following two considerations need to be accurately modeled: first, modern architectures, such as Intel Xeon Phi, sport multi- or many-core processors with shared multi-level caches featuring one or several prefetching engines. Second, algorithmic optimizations, such as spatial blocking or Semi-stencil, have complex behaviors that follow the intricacy of the above described modern architectures. In this work, a previously published performance model is extended to effectively capture these architectural and algorithmic characteristics. The extended model results show an accuracy error ranging from 5–15 %.

computing frontiers | 2018

LEGaTO: towards energy-efficient, secure, fault-tolerant toolset for heterogeneous computing

Adrian Cristal; Osman S. Unsal; Xavier Martorell; Paul M. Carpenter; Raúl de la Cruz; Leonardo Bautista; Daniel A. Jiménez; Carlos Álvarez; Behzad Salami; Sergi Madonar; Miquel Pericàs; Pedro Trancoso; Micha vor dem Berge; Gunnar Billung-Meyer; Stefan Krupop; Wolfgang Christmann; Frank Klawonn; Amani Mihklafi; Tobias Becker; Georgi Gaydadjiev; Hans Salomonsson; Devdatt Dubhashi; Oron Port; Yoav Etsion; Vesna Nowack; Christof Fetzer; Jens Hagemeyer; Thorsten Jungeblut; Nils Kucza; Martin Kaiser

LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.

Computers & Geosciences | 2016

Optimization of atmospheric transport models on HPC platforms

Raúl de la Cruz; Arnau Folch; Pau Farré; Javier Cabezas; Nacho Navarro; José María Cela

The performance and scalability of atmospheric transport models on high performance computing environments is often far from optimal for multiple reasons including, for example, sequential input and output, synchronous communications, work unbalance, memory access latency or lack of task overlapping. We investigate how different software optimizations and porting to non general-purpose hardware architectures improve code scalability and execution times considering, as an example, the FALL3D volcanic ash transport model. To this purpose, we implement the FALL3D model equations in the WARIS framework, a software designed from scratch to solve in a parallel and efficient way different geoscience problems on a wide variety of architectures. In addition, we consider further improvements in WARIS such as hybrid MPI-OMP parallelization, spatial blocking, auto-tuning and thread affinity. Considering all these aspects together, the FALL3D execution times for a realistic test case running on general-purpose cluster architectures (Intel Sandy Bridge) decrease by a factor between 7 and 40 depending on the grid resolution. Finally, we port the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and compare performance, cost and power consumption on all the architectures. Implications on time-constrained operational model configurations are discussed. HighlightsImproving the performance and scalability of atmospheric transport models, using the volcanic ash dispersal FALL3D model as an example.Porting of the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and comparison of performance, cost and power consumption on all the architectures.Implications on time-constrained operational model configurations are discussed.

Numerical Mathematics and Advanced Applications - ENUMATH 2013 | 2015

Unveiling WARIS Code, a Parallel and Multi-purpose FDM Framework

Raúl de la Cruz; Mauricio Hanzich; Arnau Folch; Guillaume Houzeaux; José María Cela

WARIS is an in-house multi-purpose framework focused on solving scientific problems using Finite Difference Methods as numerical scheme. Its framework was designed from scratch to solve in a parallel and efficient way Earth Science and Computational Fluid Dynamic problems on a wide variety of architectures. WARIS uses structured meshes to discretize the problem domains, as these are better suited for optimization in accelerator-based architectures. To succeed in such challenge, WARIS framework was initially designed to be modular in order to ease development cycles, portability, reusability and future extensions of the framework. In order to assess its performance, a code that solves the vectorial Advection-Diffusion-Sedimentation equation has been ported to the WARIS framework. This problem appears in many geophysical applications, including atmospheric transport of passive substances. As an application example, we focus on atmospheric dispersion of volcanic ash, a case in which operational code performance is critical given the threat posed by this substance on aircraft engines. Preliminary results are very promising, performance has been improved by 8.2× with respect to the baseline code using a realistic case. This opens new perspectives for operational setups, including efficient ensemble forecast.

ieee international conference on high performance computing data and analytics | 2010

Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Mohammad Jowkar; Raúl de la Cruz; José María Cela

Indirect addressing is known for being slow on conventional architectures, due to the extra step of gathering data before computations can be done. There have been proposed many methods for optimizing indirect addressing. However, these almost exclusively, merely try to change the order in which data is accessed, so as to better utilize the cache. Furthermore, vector instructions can not be used, since data is not accessed continuously, and therefore valuable processing power can not be exploited. The Cell/B.E. architecture has multiple powerful DMA engines, suitable for gathering scattered data. Unfortunately, at fine data granularity, they have several constraints which make them inefficient. In this paper, a novel solution called DMA list Interlacing (DLI) is explored, which overcomes the DMA constraints and enables the usage of vector instructions, without any extra effort. It is shown that DLI can achieve speedups of several orders of magnitude, compared to conventional processors.

Computers & Fluids | 2013