Carlo Bertolli | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carlo Bertolli is active.

Explore More

Publication

Featured researches published by Carlo Bertolli.

Ibm Journal of Research and Development | 2015

Active Memory Cube: A processing-in-memory architecture for exascale systems

Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing

ieee international conference on high performance computing data and analytics | 2012

PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes

Florian Rathgeber; Graham Markall; Lawrence Mitchell; Nicolas Loriant; David A. Ham; Carlo Bertolli; Paul H. J. Kelly

10^{18}

Proceedings of the 2014 LLVM Compiler Infrastructure in HPC on | 2014

Coordinating GPU threads for OpenMP 4.0 in LLVM

Carlo Bertolli; Samuel F. Antao; Alexandre E. Eichenberger; Kevin O'Brien; Zehra Sura; Arpith C. Jacob; Tong Chen; Olivier Sallenave

floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

IEEE Transactions on Parallel and Distributed Systems | 2016

Acceleration of a Full-Scale Industrial CFD Application with OP2

I. Z. Reguly; Gihan R. Mudalige; Carlo Bertolli; Michael B. Giles; Adam Betts; Paul H. J. Kelly; David Radford

Emerging many-core platforms are very difficult to program in a performance portable manner whilst achieving high efficiency on a diverse range of architectures. We present work in progress on PyOP2, a high-level embedded domain-specific language for mesh-based simulation codes that executes numerical kernels in parallel over unstructured meshes. Just-in-time kernel compilation and parallel scheduling are delayed until runtime, when problem-specific parameters are available. Using generative metaprogramming, performance portability is achieved, while details of the parallel implementation are abstracted from the programmer. PyOP2 kernels for finite element computations can be generated automatically from equations given in the domain-specific Unified Form Language. Interfacing to the multi-phase CFD code Fluidity through a very thin layer on top of PyOP2 yields a general purpose finite element solver with an input notation very close to mathematical formulae. Preliminary performance figures show speedups of up to 3.4× compared to Fluiditys built-in solvers when running in parallel.

parallel computing | 2013

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

Gihan R. Mudalige; Michael B. Giles; Jeyarajan Thiyagalingam; I. Z. Reguly; Carlo Bertolli; Paul H. J. Kelly; Anne E. Trefethen

GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs. One main issue in implementing OpenMP on a GPU is related to efficiently supporting sequential and parallel regions, as GPUs are only optimized to execute highly parallel workloads. Multiple solutions to this issue were proposed in previous research. In this paper, we propose a method to coordinate threads in an NVIDIA GPU that is both efficient and easily integrated as part of a compiler. To support our claims, we developed CUDA programs that mimic multiple coordination schemes and we compare their performances. We show that a scheme based on dynamic parallelism performs poorly compared to inspector-executor schemes that we introduce in this paper. We also discuss how to integrate these schemes to the LLVM compiler infrastructure.

computing frontiers | 2015

Data access optimization in a processing-in-memory system

Zehra Sura; Arpith C. Jacob; Tong Chen; Bryan S. Rosenburg; Olivier Sallenave; Carlo Bertolli; Samuel F. Antao; José R. Brunheroto; Yoonho Park; Kevin O'Brien; Ravi Nair

Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc., capable of performing complex simulations over highly detailed unstructured mesh geometries. Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework, demonstrating the viability of such a high-level programming approach. OP2 targets the domain of unstructured mesh problems and enables execution on a range of back-end hardware platforms. We chart the conversion of Hydra to OP2, and map out the key difficulties encountered in the process. Specifically we show how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application and how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original code could be achieved, but it can be significantly improved on conventional processor systems, and many-core systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range of contrasting platforms and their significant utility in achieving high performance without the intervention of the application programmer.

ieee international conference on high performance computing data and analytics | 2015

Performance analysis of OpenMP on a GPU using a CORAL proxy application

Gheorghe-Teodor Bercea; Carlo Bertolli; Samuel F. Antao; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien

Discuss the main design issues in parallelizing unstructured mesh applications.Present OP2 for developing applications for heterogeneous parallel systems.Analyze the performance gained with OP2 for two industrial-representative benchmarks.Compare runtime, scaling and runtime break-downs of the applications.Present energy consumption of OP2 applications on CPU and GPU clusters. OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

international conference on autonomic computing | 2009

Expressing Adaptivity and Context Awareness in the ASSISTANT Programming Model

Carlo Bertolli; Daniele Buono; Gabriele Mencagli; Marco Vanneschi

The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.

Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC | 2015

Integrating GPU support for OpenMP offloading directives into Clang

Carlo Bertolli; Samuel F. Antao; Gheorghe-Teodor Bercea; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien

OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite. NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.

international supercomputing conference | 2013

Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS

Graham Markall; Florian Rathgeber; Lawrence Mitchell; Nicolas Loriant; Carlo Bertolli; David A. Ham; Paul H. J. Kelly

Pervasive Grid computing platforms are composed of a variety of fixed and mobile nodes, interconnected through multiple wireless and wired network technologies. Pervasive Grid Applications must adapt themselves to the state of their surrounding environment (context), which includes the state of the resources on which they are executed. By focusing on a specific instance of emergency management application, we show how a complex high-performance problem can be solved according to multiple parallelization methodologies. We introduce the ASSISTANT programming model which allows programmers to express multiple versions of a same parallel module, each of them suitable for particular context situations. We show how the exemplified programs can be included in a single ASSISTANT parallel module and how their dynamic switching can be expressed. We provide experimental results demonstrating the effectiveness of the approach.

Explore More