Philippe Thierry | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Philippe Thierry is active.

Explore More

Publication

Featured researches published by Philippe Thierry.

High Performance Parallelism Pearls#R##N#Multicore and Many-Core Programming Approaches | 2015

Characterization and Optimization Methodology Applied to Stencil Computations

Cedric Andreolli; Philippe Thierry; Leonardo Borges; Gregg Skinner; Chuck Yount

This chapter describes the characterization and optimization methodology applied to a 3D finite differences (3DFD) algorithm used to solve constant or variable density isotropic acoustic wave equation (Iso-3DFD).

Geophysics | 2010

High-performance 3D first-arrival traveltime tomography

Mark Noble; Philippe Thierry; Cédric Taillandier; Henri Calandra

The determination of the correct velocity structure of the near surface is a crucial step in seismic data processing and depth imaging. Generally, first-arrival traveltime tomography based on refraction data or diving waves is used to assess a velocity model of the subsurface that best explains the data. Such first-arrival traveltime tomography algorithms are very attractive for land data processing because early events in the seismic records are very often dominated by noise, and reflected events are very difficult or even impossible to identify. On the other hand, first arrivals can generally be identified quite clearly and are very often the only data available to reconstruct the near-surface velocity structure.

ieee international conference on high performance computing data and analytics | 2014

Genetic Algorithm Based Auto-Tuning of Seismic Applications on Multi and Manycore Computers

C. Andreolli; Philippe Thierry; L. Borges; C. Yount; G. Skinner

Complex computer systems exhibit many different characteristics that the best parameter choice becomes impossible to define. The range of parameters impacting the performance is too large to be solved by simple trial and error when considering manual tuning techniques, the domain decomposition influence, the compiler capabilities and hardware impact. Then auto-tuning appears now as an elegant solution to optimize source codes before compilation, using different compiler flags or at run time by tuning the input parameters. Starting from the basic implementation of a 3D finite differences kernel, we describe first the methodology to get an estimate of the best performance an algorithm can deliver. To get close to this theoretical achievable performance we present several tuning steps from the basic up to a full intrinsic implementation in order to improve parallelism, vectorization and data locality. Then to get to the best set of parameters, we introduce an auto-tuning methodology based on a genetic algorithm search. We are able to optimize for cache blocking sizes, domain decomposition shapes, prefetching flags or even power consumption, among others. From the un-optimized to the most optimized version, we achieved more than 6x performance improvement on the E5-2697v2 and almost 30x improvement on Xeon Phi.

Geophysics | 2010

Trends for high-performance scientific computing

William J. Camp; Philippe Thierry

Among all scientific domains, geophysics is certainly one of the most computationally demanding, with probably the broadest requirements for per-formance and scalability.

high performance computing and communications | 2015

Communication-Avoiding Seismic Numerical Kernels on Multicore Processors

Fabrice Dupros; Faiza Boulahya; Hideo Aochi; Philippe Thierry

The finite-difference method is routinely used to simulate seismic wave propagation both in the oil and gas industry and in strong motion analysis in seismology. This numerical method also lies at the heart of a significant fraction of numerical solvers in other fields. In terms of computational efficiency, one of the main difficulties is to deal with the disadvantageous ratio between the limited pointwise computation and the intensive memory access required, leading to a memory-bound situation. Naive sequential implementations offer poor cache-reuse and achieve in general a low fraction of peak performance of the processors. The situation is worst on multicore computing nodes with several levels of memory hierarchy. In this case, each cache miss corresponds to a costly memory access. Additionally, the memory bandwidth available on multicore chips improves slowly regarding the number of computing core which induces a dramatic reduction of the expected parallel performance. In this article, we introduce a cache-efficient algorithm for stencil-based computations using a decomposition along both the space and the time directions. We report a maximum speedup of x3.59 over the standard implementation.

ieee international conference on high performance computing data and analytics | 2015

OpenVec Portable SIMD Intrinsics

P. Souza; L. Borges; C. Andreolli; Philippe Thierry

Today, the widest vector units found on a mass production processor are in the Intel Xeon Phi coprocessor with its 512-bit vector registers. These vector units have a theoretical single precision peak performance gain of 16x for single flop operations. In practice, due to limiting factors like memory access latency, I/O demand, serial code sections, and global synchronization, the real performance improvement number is typically much lower. In this work, we present a solution to take advantage of vector units across various processor SIMD architectures with a single, portable source code. This is accomplished by just adding a vector type and hardware intrinsics support to C/C language through a header file that is compatible with gcc and commercially available compilers in general. We hide different hardware/compiler feature sets under a common portable programming syntax. In addition, the implementation supports a scalar backend alternative to target unknown architectures. This implementation has been successfully demonstrated on multiple SIMD architectures including Intel SSE/AVX/AVX-512/IMCI, ARM NEON and IBM Power VSX using only a common header file to enable the compiler to generate highly optimized code with proper SIMD instructions for the given underlying architecture.

ieee international conference on high performance computing data and analytics | 2014

Speeding-up FWI by One Order of Magnitude

V. Etienne; T. Tonellot; Philippe Thierry; V. Berthoumieux; C. Andreolli

We present several strategies to speed-up full waveform inversion via specific optimizations of the time-domain finite-difference modeling. Efficient vectorization of the computations is achieved on the Intel xeon computing core when modifying the absorbing boundaries. We propose also to increase the computational speed by using high orders in space and by solving the second-order wave equation instead of the first-order formulation. These strategies combined together allow for a reduction of the computation time by a factor of 27 for modeling using the SEG SEAM II model. Finally, we show that the optimized algorithm has a quasi-perfect scalability on one dual-socket computing node.

ieee international conference on high performance computing data and analytics | 2014

Reverse Time Migration with Heterogeneous Multicore and Manycore Clusters

P. Souza; T. Teixeira; L. Borges; A. Neto; C. Andreolli; Philippe Thierry

In this work we propose a parallel implementation of RTM based on cooperative work between CPUs and coprocessors that proved to be competitive to other accelerated solutions available. This implementation is able to run whatever the number of coprocessors is (from 0 to the maximum available with respect to the computer vendor specifications), and is very scalable in a cluster environment. Based on standard programming model it will also be portable without modification to any future configurations of Xeon and Xeon Phi, or X-CPU Y-CPU that supports MPI OpenMP C language. Here describe our unified programing model for optimized code. We also discuss load balancing of the heterogeneous cluster configuration; validate the performance; and scalability of the current implementation. In the current configuration with 4 Xeon Phi cards with 16GB GDDR5 (64 GB total), we can migrate full shot gathers on a single node. This proposed node configuration also frees memory in the 2-socket host for RTM formulations that might require saving snapshots for cross-correlation and any other auxiliary arrays between iterations of the algorithm.

ieee international conference on high performance computing, data, and analytics | 2018

A Novel Multi-level Integrated Roofline Model Approach for Performance Characterization

Tuomas Koskela; Zakhar Matveev; Charlene Yang; Adetokunbo Adedoyin; Roman Belenov; Philippe Thierry; Zhengji Zhao; Rahulkumar Gayatri; Hongzhang Shan; Leonid Oliker; Jack Deslippe; Ron Green; Samuel Williams

With energy-efficient architectures, including accelerators and many-core processors, gaining traction, application developers face the challenge of optimizing their applications for multiple hardware features including many-core parallelism, wide processing vector-units and on-chip high-bandwidth memory. In this paper, we discuss the development and utilization of a new application performance tool based on an extension of the classical roofline-model for simultaneously profiling multiple levels in the cache-memory hierarchy. This tool presents a powerful visual aid for the developer and can be used to frame the many-dimensional optimization problem in a tractable way. We show case studies of real scientific applications that have gained insights from the Integrated Roofline Model.

ieee international conference on high performance computing data and analytics | 2017

High-Performance Seismic Modeling with Finite-Difference Using Spatial and Temporal Cache Blocking

V. Etienne; T. Tonellot; T. Malas; H. Ltaief; S. Kortas; Philippe Thierry; D. Keyes

The time-domain finite-difference method (TD-FDM) has been used in geophysics for decades for modeling and imaging. It is used intensively for applications that require accurate solutions for the wave equation such as reverse time migration (RTM) or full waveform inversion (FWI). In this study, we investigate how spatial and temporal cache blocking techniques can speed up computation in TD-FDM on multi-core architectures. We conducted our analysis on the Shaheen II supercomputer at the King Abdullah University of Science and Technology (KAUST) and present the current and achievable performances by using a Cache Aware Roofline Model (CARM). We briefly discuss the implementations and the benefits of spatial and temporal cache blocking techniques individually, and we provide preliminary results, which pave the way for achieving the TD-FDM’s maximum efficiency.

Explore More