Gheorghe-Teodor Bercea

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gheorghe-Teodor Bercea is active.

Explore More

Publication

Featured researches published by Gheorghe-Teodor Bercea.

ACM Transactions on Mathematical Software | 2017

Firedrake: Automating the Finite Element Method by Composing Abstractions

Florian Rathgeber; David A. Ham; Lawrence Mitchell; Fabio Luporini; Andrew T. T. McRae; Gheorghe-Teodor Bercea; Graham Markall; Paul H. J. Kelly

Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns which eases the incorporation of separate contributions from computer scientists, numerical analysts and application specialists. These contributions may add functionality, or improve performance. Firedrake benefits from automatically applying new optimisations. This includes factorising mixed function spaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside pure variational formulations, such as flux-limiters.

ACM Transactions on Architecture and Code Optimization | 2015

Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

Fabio Luporini; Ana Lucia Varbanescu; Florian Rathgeber; Gheorghe-Teodor Bercea; J. Ramanujam; David A. Ham; Paul H. J. Kelly

The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved.

ieee international conference on high performance computing data and analytics | 2015

Performance analysis of OpenMP on a GPU using a CORAL proxy application

Gheorghe-Teodor Bercea; Carlo Bertolli; Samuel F. Antao; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien

OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite. NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.

Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC | 2015

Integrating GPU support for OpenMP offloading directives into Clang

Carlo Bertolli; Samuel F. Antao; Gheorghe-Teodor Bercea; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien

The LLVM community is currently developing OpenMP 4.1 support, consisting of software improvements for Clang and new runtime libraries. OpenMP 4.1 includes offloading constructs that permit execution of user selected regions on generic devices, external to the main host processor. This paper describes our ongoing work towards delivering support for OpenMP offloading constructs for the OpenPower system into the LLVM compiler infrastructure. We previously introduced a design for a control loop scheme necessary to implement the OpenMP generic offloading model on NVIDIA GPUs. In this paper we show how we integrated the complexity of the control loop into Clang by limiting its support to OpenMP-related functionality. We also synthetically report the results of performance analysis on benchmarks and a complex application kernel. We show an optimization in the Clang code generation scheme for specific code patterns, alternative to the control loop, which delivers improved performance.

SIAM Journal on Scientific Computing | 2016

Automated generation and symbolic manipulation of tensor product finite elements

Andrew T. T. McRae; Gheorghe-Teodor Bercea; Lawrence Mitchell; David A. Ham; Colin J. Cotter

We describe and implement a symbolic algebra for scalar and vector-valued finite elements, enabling the computer generation of elements with tensor product structure on quadrilateral, hexahedral and triangular prismatic cells. The algebra is implemented as an extension to the domain-specific language UFL, the Unified Form Language. This allows users to construct many finite element spaces beyond those supported by existing software packages. We have made corresponding extensions to FIAT, the FInite element Automatic Tabulator, to enable numerical tabulation of such spaces. This tabulation is consequently used during the automatic generation of low-level code that carries out local assembly operations, within the wider context of solving finite element problems posed over such function spaces. We have done this work within the code-generation pipeline of the software package Firedrake; we make use of the full Firedrake package to present numerical examples.

international workshop on openmp | 2016

Early Experiences Porting Three Applications to OpenMP 4.5

Ian Karlin; Tom Scogland; Arpith C. Jacob; Samuel F. Antao; Gheorghe-Teodor Bercea; Carlo Bertolli; Bronis R. de Supinski; Erik W. Draeger; Alexandre E. Eichenberger; Jim Glosli; Holger E. Jones; Adam Kunen; David Poliakoff; David F. Richards

Many application developers need code that runs efficiently on multiple architectures, but cannot afford to maintain architecturally specific codes. With the addition of target directives to support offload accelerators, OpenMP now has the machinery to support performance portable code development. In this paper, we describe application ports of Kripke, Cardioid, and LULESH to OpenMP 4.5 and discuss our successes and failures. Challenges encountered include how OpenMP interacts with C++ including classes with virtual methods and lambda functions. Also, the lack of deep copy support in OpenMP increased code complexity. Finally, GPUs inability to handle virtual function calls required code restructuring. Despite these challenges we demonstrate OpenMP obtains performance within 10 % of hand written CUDA for memory bandwidth bound kernels in LULESH. In addition, we show with a minor change to the OpenMP standard that register usage for OpenMP code can be reduced by up to 10 %.

international parallel and distributed processing symposium | 2014

Generalizing Run-Time Tiling with the Loop Chain Abstraction

Michelle Mills Strout; Fabio Luporini; Christopher D. Krieger; Carlo Bertolli; Gheorghe-Teodor Bercea; Catherine Olschanowsky; J. Ramanujam; Paul H. J. Kelly

Many scientific applications are organized in a data parallel way: as sequences of parallel and/or reduction loops. This exposes parallelism well, but does not convert data reuse between loops into data locality. This paper focuses on this issue in parallel loops whose loop-to-loop dependence structure is data-dependent due to indirect references such as A[B[i]]. Such references are a common occurrence in sparse matrix computations, molecular dynamics simulations, and unstructured-mesh computational fluid dynamics (CFD). Previously, sparse tiling approaches were developed for individual benchmarks to group iterations across such loops to improve data locality. These approaches were shown to benefit applications such as moldyn, Gauss-Seidel, and the sparse matrix powers kernel, however the run-time routines for performing sparse tiling were hand coded per application. In this paper, we present a generalized full sparse tiling algorithm that uses the newly developed loop chain abstraction as input, improves inter-loop data locality, and creates a task graph to expose shared-memory parallelism at runtime. We evaluate the overhead and performance impact of the generalized full sparse tiling algorithm on two codes: a sparse Jacobi iterative solver and the Airfoil CFD benchmark.

Archive | 2016

PyOP2: Framework for performance-portable parallel computations on unstructured meshes

Florian Rathgeber; Hector Dearman; gbts; Gheorghe-Teodor Bercea; Kaho Sato; Simon W. Funke; Lawrence Mitchell; Miklós Homolya; Francis Russell; Christian T. Jacobs; David A. Ham; Andrew T. T. McRae; Graham Markall; Fabio Luporini

Geoscientific Model Development | 2016

A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake

Gheorghe-Teodor Bercea; Andrew T. T. McRae; David A. Ham; Lawrence Mitchell; Florian Rathgeber; Luigi Nardi; Fabio Luporini; Paul H. J. Kelly

Archive | 2016

firedrake: an automated finite element system

Lawrence Mitchell; Colin J. Cotter; Gheorghe-Teodor Bercea; Asbjørn Nilsen Riseth; Simon W. Funke; Graham Markall; Eike Hermann Mueller; Tuomas Kärnä; Patrick E. Farrell; Geordie McBain; Miklós Homolya; Henrik Büsing; Anna Kalogirou; Christian T. Jacobs; David A. Ham; Andrew T. T. McRae; Florian Rathgeber; Hannah Rittich; Stephan C. Kramer; Fabio Luporini

Explore More