Amik St-Cyr
Royal Dutch Shell
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amik St-Cyr.
Journal of Computational Physics | 2012
Natasha Flyer; Erik Lehto; Sébastien Blaise; Grady B. Wright; Amik St-Cyr
The current paper establishes the computational efficiency and accuracy of the RBF-FD method for large-scale geoscience modeling with comparisons to state-of-the-art methods as high-order discontinuous Galerkin and spherical harmonics, the latter using expansions with close to 300,000 bases. The test cases are demanding fluid flow problems on the sphere that exhibit numerical challenges, such as Gibbs phenomena, sharp gradients, and complex vortical dynamics with rapid energy transfer from large to small scales over short time periods. The computations were possible as well as very competitive due to the implementation of hyperviscosity on large RBF stencil sizes (corresponding roughly to 6th to 9th order methods) with up to O(10^5) nodes on the sphere. The RBF-FD method scaled as O(N) per time step, where N is the total number of nodes on the sphere. In Appendix A, guidelines are given on how to chose parameters when using RBF-FD to solve hyperbolic PDEs.
SIAM Journal on Scientific Computing | 2012
Olivier Dubois; Martin J. Gander; Sébastien Loisel; Amik St-Cyr; Daniel B. Szyld
Optimized Schwarz methods (OSMs) use Robin transmission conditions across the subdomain interfaces. The Robin parameter can then be optimized to obtain the fastest convergence. A new formulation is presented with a coarse grid correction. The optimal parameter is computed for a model problem on a cylinder, together with the corresponding convergence factor which is smaller than that of classical Schwarz methods. A new coarse space is presented, suitable for OSM. Numerical experiments illustrating the effectiveness of OSM with a coarse grid correction, both as an iteration and as a preconditioner, are reported.
Geophysical Journal International | 2015
Axel Modave; Amik St-Cyr; Wim A. Mulder; Tim Warburton
Improving both accuracy and computational performance of numerical tools is a major challenge for seismic imaging and generally requires specialized implementations to make full use of modern parallel architectures. We present a computational strategy for reverse-time migration (RTM) with accelerator-aided clusters. A new imaging condition computed from the pressure and velocity fields is introduced. The model solver is based on a high-order discontinuous Galerkin time-domain (DGTD) method for the pressure–velocity system with unstructured meshes and multirate local time stepping. We adopted the MPI+X approach for distributed programming where X is a threaded programming model. In this work we chose OCCA, a unified framework that makes use of major multithreading languages (e.g. CUDA and OpenCL) and offers the flexibility to run on several hardware architectures. DGTD schemes are suitable for efficient computations with accelerators thanks to localized element-to-element coupling and the dense algebraic operations required for each element. Moreover, compared to high-order finite-difference schemes, the thin halo inherent to DGTD method reduces the amount of data to be exchanged between MPI processes and storage requirements for RTM procedures. The amount of data to be recorded during simulation is reduced by storing only boundary values in memory rather than on disk and recreating the forward wavefields. Computational results are presented that indicate that these methods are strong scalable up to at least 32 GPUs for a three-dimensional RTM case.
Computers & Geosciences | 2016
Axel Modave; Amik St-Cyr; Tim Warburton
Finite element schemes based on discontinuous Galerkin methods possess features amenable to massively parallel computing accelerated with general purpose graphics processing units (GPUs). However, the computational performance of such schemes strongly depends on their implementation. In the past, several implementation strategies have been proposed. They are based exclusively on specialized compute kernels tuned for each operation, or they can leverage BLAS libraries that provide optimized routines for basic linear algebra operations. In this paper, we present and analyze up-to-date performance results for different implementations, tested in a unified framework on a single NVIDIA GTX980 GPU. We show that specialized kernels written with a one-node-per-thread strategy are competitive for polynomial bases up to the fifth and seventh degrees for acoustic and elastic models, respectively. For higher degrees, a strategy that makes use of the NVIDIA cuBLAS library provides better results, able to reach a net arithmetic throughput 35.7% of the theoretical peak value. HighlightsSeveral GPU implementations for time-domain wave simulations are compared.The numerical schemes are based on a high-order discontinuous finite element method.The implementations are profiled using the roofline model to highlight bottlenecks.The best implementation depends on the polynomial degree of the basis functions.
irregular applications: architectures and algorithms | 2013
Cheng Wang; Mauricio Araya-Polo; Sunita Chandrasekaran; Amik St-Cyr; Barbara M. Chapman; Detlef Hohl
The Fast Fourier Transform (FFT) is a widely used numerical algorithm. When N input data points lead to only k << N non-zero coefficients in the transformed domain, the algorithm is clearly inefficient: the FFT performs O(NlogN) operations on N input data points in order to calculate only k non-zero or large coefficients, and N -- k zero or negligibly small ones. The recently developed sparse FFT (sFFT) algorithm provides a solution to this problem. As are those for the FFT, sFFT algorithms are complex and still computationally challenging. The computational difficulties are mainly due to memory access patterns that are irregular and dynamically changing. Modern compute platforms are exclusively based on multi-core processors, therefore a natural path to enhance the sFFTs performance is to exploit parallelism. This is the approach chosen in this work. We have analyzed in detail and parallelized the most time consuming segments of the algorithm. Our parallel sFFT (PsFFT) implementation achieves approximately 60% parallel efficiency on a single 8-core Intel Sandy Bridge socket for relevant test cases. In addition, we apply several techniques such as index coalescing, data affiliated loops and multi-level blocking techniques to alleviate memory access congestion and increase performance.
arXiv: Numerical Analysis | 2015
David Medina; Amik St-Cyr; Timothy Warburton
High-order finite-difference methods are commonly used in wave propagator for industrial subsurface imaging algorithms. Computational aspects of the reduced linear elastic vertical transversely isotropic propagator are considered. Thread parallel algorithms suitable for implementing this propagator on multi-core and many-core processing devices are introduced. Portability is addressed through the use of the OCCA runtime programming interface. Finally, performance results are shown for various architectures on a representative synthetic test case.
international parallel and distributed processing symposium | 2014
Wei Ding; Ligang Lu; Mauricio Araya-Polo; Amik St-Cyr; Detlef Hohl; Barbara M. Chapman
Graphic Processing Units (GPUs) have been increasingly adopted by the High-Performance Computing community. Its unique hardware architecture supports hundreds or housands of light-weighted threads in a more power efficient manner compared with traditional CPUs, and with higher overall performance. This motivates highly parallel applications to be ported to GPUs. Programming GPUs is not a trivial task in particular for programmers familiar with X86-like architectures. CUDA and OpenCL are two low-level programming APIs which are designed to ease the GPU programming. Unfortunately, the resultant GPU codes greatly depart from traditional codes in both syntax and structure, making code hard to maintain. In order to keep the original code structure, directive-based programming models have been developed (OpenACC, HMPP, etc). In such programming models, the code is augmented with directives (as when using OpenMP) to guide the compiler to generate CUDA/OpenCL code automatically. To optimize performance, code restructuring is needed to make full and specific use of the GPU hardware advantages, e.g. GPU shared memory. In this paper, we explore various directive-based approaches to port a well-known Oil and Gas industry algorithm (Reverse Time Migration, or RTM) to GPUs while trying to balance code portability and performance maximization. Our HMPP implementation achieves 85% performance of the highly optimized version of CUDA result at the time of this work in the summer of 2013.
Geophysics | 2015
Bradley Martin; Bengt Fornberg; Amik St-Cyr
Seg Technical Program Expanded Abstracts | 2014
Bradley Martin; Bengt Fornberg; Amik St-Cyr; Natasha Flyer
Archive | 2015
Axel Modave; Amik St-Cyr; Wim A. Mulder; Tim Warburton