Mauricio Araya-Polo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mauricio Araya-Polo is active.

Explore More

Publication

Featured researches published by Mauricio Araya-Polo.

IEEE Transactions on Parallel and Distributed Systems | 2011

Assessing Accelerator-Based HPC Reverse Time Migration

Mauricio Araya-Polo; Javier Cabezas; Mauricio Hanzich; Miquel Pericàs; Felix Rubio; Isaac Gelado; Muhammad Shafiq; Enric Morancho; Nacho Navarro; Eduard Ayguadé; José María Cela; Mateo Valero

Oil and gas companies trust Reverse Time Migration (RTM), the most advanced seismic imaging technique, with crucial decisions on drilling investments. The economic value of the oil reserves that require RTM to be localized is in the order of 1013 dollars. But RTM requires vast computational power, which somewhat hindered its practical success. Although, accelerator-based architectures deliver enormous computational power, little attention has been devoted to assess the RTM implementations effort. The aim of this paper is to identify the major limitations imposed by different accelerators during RTM implementations, and potential bottlenecks regarding architecture features. Moreover, we suggest a wish list, that from our experience, should be included as features in the next generation of accelerators, to cope with the requirements of applications like RTM. We present an RTM algorithm mapping to the IBM Cell/B.E., NVIDIA Tesla and an FPGA platform modeled after the Convey HC-1. All three implementations outperform a traditional processor (Intel Harpertown) in terms of performance (10x), but at the cost of huge development effort, mainly due to immature development frameworks and lack of well-suited programming models. These results show that accelerators are well positioned platforms for this kind of workload. Due to the fact that our RTM implementation is based on an explicit high order finite difference scheme, some of the conclusions of this work can be extrapolated to applications with similar numerical scheme, for instance, magneto-hydrodynamics or atmospheric flow simulations.

ieee international conference on high performance computing data and analytics | 2009

3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Mauricio Araya-Polo; Felix Rubio; Raúl de la Cruz; Mauricio Hanzich; José María Cela; Daniele Paolo Scarpazza

Reverse-Time Migration (RTM) is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP) in terms of performance (with an 15.0× speedup) and energy-efficiency (with a 10.0× increase in the GFlops/W delivered). Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.

field-programmable technology | 2009

Exploiting memory customization in FPGA for 3D stencil computations

Muhammad Shafiq; Miquel Pericàs; Raúl de la Cruz; Mauricio Araya-Polo; Nacho Navarro; Eduard Ayguadé

3D stencil computations are compute-intensive kernels often appearing in high-performance scientific and engineering applications. The key to efficiency in these memory-bound kernels is full exploitation of data reuse. This paper explores the design aspects for 3D-Stencil implementations that maximize the reuse of all input data on a FPGA architecture. The work focuses on the architectural design of 3D stencils with the form n × (n + 1) × n, where n = {2, 4, 6, 8, ...}. The performance of the architecture is evaluated using two design approaches, ¿Multi-Volume¿ and ¿Single-Volume¿. When n = 8, the designs achieve a sustained throughput of 55.5 GFLOPS in the ¿Single-Volume¿ approach and 103 GFLOPS in the ¿Multi-Volume¿ design approach in a 100-200 MHz multi-rate implementation on a Virtex-4 LX200 FPGA. This corresponds to a stencil data delivery of 1500 bytes/cycle and 2800 bytes/cycle respectively. The implementation is analyzed and compared to two CPU cache approaches and to the statically scheduled local stores on the IBM PowerXCell 8i. The FPGA approaches designed here achieve much higher bandwidth despite the FPGA device being the least recent of the chips considered. These numbers show how a custom memory organization can provide large data throughput when implementing 3D stencil kernels.

international conference of the chilean computer science society | 2009

High-Performance Reverse Time Migration on GPU

Javier Cabezas; Mauricio Araya-Polo; Isaac Gelado; Nacho Navarro; Enric Morancho; José María Cela

Partial Differential Equations (PDE) are the heart of most simulations in many scientific fields, from Fluid Mechanics to Astrophysics. One the most popular mathematical schemes to solve a PDE is Finite Difference (FD). In this work we map a PDE-FD algorithm called Reverse Time Migration to a GPU using CUDA. This seismic imaging (Geophysics) algorithm is widely used in the oil industry. GPUs are natural contenders in the aftermath of the clock race, in particular for High-performance Computing (HPC). Due to GPU characteristics, the parallelism paradigm shifts from the classical threads plus SIMD to Single Program Multiple Data (SPMD). The NVIDIA GTX 280 implementation outperforms homogeneous CPUs up to 9x (Intel Harpertown E5420) and up to 14x (IBM PPC 970). These preliminary results confirm that GPUs are a real option for HPC, from performance to programmability.

Seg Technical Program Expanded Abstracts | 2008

Evaluation of 3D RTM On HPC Platforms

Francisco Ortigosa; Mauricio Araya-Polo; Felix Rubio; Mauricio Hanzich; Raúd de la Cruz; José María Cela

Reverse Time Migration (RTM) has become the latest chapter in seismic imaging for geologically complex subsurface areas. In particular has proven to be very useful for the subsaly oil plays of the US Gulf of Mexico. However, RTM cannot be applied extensively due to the extreme computational demand. The recent availability of multi-core processors, homogeneous and heterogeneous, may provide the required compute power. In this paper, we benchmark an effective RTM algorithm on several HPC platforms to assess viability of hardware.

international conference on conceptual structures | 2011

Towards a Multi-Level Cache Performance Model for 3D Stencil Computation

Raúl de la Cruz; Mauricio Araya-Polo

Abstract It is crucial to optimize stencil computations since they are the core (and most computational demanding segment) of many Scientific Computing applications, therefore reducing overall execution time. This is not a simple task, actually it is lengthy and tedious. It is lengthy because the large number of stencil optimizations combinations to test, which might consume days of computing time, and the process is tedious due to the slightly different versions of code to implement. Alternatively, models that predict performance can be built without any actual stencil execution, thus reducing the cumbersome optimization task. Previous works have proposed cache misses and execution time models for specific stencil optimizations. Furthermore, most of them have been designed for 2D datasets or stencil sizes that only suit low order numerical schemes. We propose a flexible and accurate model for a wide range of stencil sizes up to high order schemes, that captures the behavior of 3D stencil computations using platform parameters. The model has been tested in a group of representative hardware architectures, using realistic dataset sizes. Our model predicts successfully stencil execution times and cache misses. However, predictions accuracy depends on the platform, for instance on x86 architectures prediction errors ranges between 1-20%. Therefore, the model is reliable and can help to speed up the stencil computation optimization process. To that end, other stencil optimization techniques can be added to this model, thus essentially providing a framework which covers most of the state-of-the-art.

ACM Transactions on Mathematical Software | 2014

Algorithm 942: Semi-Stencil

Raúl de la Cruz; Mauricio Araya-Polo

Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time. Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named Semi-stencil, that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves the memory access pattern and efficiently reuses the accessed data. Our initial target architecture was the Cell/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multicore architectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space- and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76× faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multicore based, we have introduced a brief section where scalability results on IBM POWER7-, Intel Xeon-, and MIC-based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scalability of Semi-stencil on MIC for a certain testcase reached 93.8 × over 244 threads.

Seg Technical Program Expanded Abstracts | 2008

3D Reverse-time Migration With Hybrid Finite Difference-pseudospectral Method

Anne-Cecile Lesage; Hongbo Zhou; Mauricio Araya-Polo; Jose-Maria Cela; Francisco Ortigosa

We propose a 3D reverse-time migration (RTM) using an hybrid Finite Difference (FD) pseudospectral algorithm to solve the two-way acoustic equation. This mainly consists of forward-backward 2D FFT in lateral dimensions (x-y plane) and 1D FD in the depth dimension. This algorithm allows us to get high order accuracy and simplifies the computation of cross derivatives. Therefore our RTM allows to deal with the case of 3D isotropic media, VTI media (Zhou et al., 2006b) and 3D TTI media. The 3D TTI media case lies on a new anisotropic wave equations system (Lesage et al., 2008), which is an extension of 3D VTI media (Zhou et al., 2006b) and 2D TTI media equations (Zhou et al., 2006a) of Zhou et al. . This system is based on the combination of two 3D rotations which permits us to deduce 3D TTI equations from 3D VTI equations. In this work, we recall the formalization for 3D isotropic, 3D VTI and 3D TTI media, we also describe the implementation of the method to solve the proposed 3D TTI equations. To validate our proposal, we carry out impulse response experiments for modeling and migration.

irregular applications: architectures and algorithms | 2013

Parallel sparse FFT

Cheng Wang; Mauricio Araya-Polo; Sunita Chandrasekaran; Amik St-Cyr; Barbara M. Chapman; Detlef Hohl

The Fast Fourier Transform (FFT) is a widely used numerical algorithm. When N input data points lead to only k << N non-zero coefficients in the transformed domain, the algorithm is clearly inefficient: the FFT performs O(NlogN) operations on N input data points in order to calculate only k non-zero or large coefficients, and N -- k zero or negligibly small ones. The recently developed sparse FFT (sFFT) algorithm provides a solution to this problem. As are those for the FFT, sFFT algorithms are complex and still computationally challenging. The computational difficulties are mainly due to memory access patterns that are irregular and dynamically changing. Modern compute platforms are exclusively based on multi-core processors, therefore a natural path to enhance the sFFTs performance is to exploit parallelism. This is the approach chosen in this work. We have analyzed in detail and parallelized the most time consuming segments of the algorithm. Our parallel sFFT (PsFFT) implementation achieves approximately 60% parallel efficiency on a single 8-core Intel Sandy Bridge socket for relevant test cases. In addition, we apply several techniques such as index coalescing, data affiliated loops and multi-level blocking techniques to alleviate memory access congestion and increase performance.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Filesystem Aware Scalable I/O Framework for Data-Intensive Parallel Applications

Rengan Xu; Mauricio Araya-Polo; Barbara M. Chapman

The growing speed gap between CPU and memory makes I/O the main bottleneck of many industrial applications. Some applications need to perform I/O operations for very large volume of data frequently, which will harm the performance seriously. This works motivation are geophysical applications used for oil and gas exploration. These applications process Terabyte size datasets in HPC facilities. The datasets represent subsurface models and field recorded data. In general term, these applications read as inputs and write as intermediate/final results huge amount of data, where the underlying algorithms implement seismic imaging techniques. The traditional sequential I/O, even when couple with advance storage systems, cannot complete all I/O operations for so large volumes of data in an acceptable time range. Parallel I/O is the general strategy to solve such problems. However, because of the dynamic property of many of these applications, each parallel process does not know the data size it needs to write until its computation is done, and it also cannot identify the position in the file to write. In order to write correctly and efficiently, communication and synchronization are required among all processes to fully exploit the parallel I/O paradigm. To tackle these issues, we use a dynamic load balancing framework that is general enough for most of these applications. And to reduce the expensive synchronization and communication overhead, we introduced a I/O node that only handles I/O request and let compute nodes perform I/O operations in parallel. By using both POSIX I/O and memory-mapping interfaces, the experiment indicates that our approach is scalable. For instance, with 16 processes, the bandwidth of parallel reading can reach the theoretical peak performance (2.5 GB/s) of the storage infrastructure. Also, the parallel writing can be up to 4.68x (speedup, POSIX I/O) and 7.23x (speedup, memory-mapping) more efficient than the serial I/O implementation. Since, most geophysical applications are I/O bounded, these results positively impact the overall performance of the application, and confirm the chosen strategy as path to follow.

Explore More