Samuel F. Antao
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Samuel F. Antao.
Ibm Journal of Research and Development | 2015
Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl
Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing
Proceedings of the 2014 LLVM Compiler Infrastructure in HPC on | 2014
Carlo Bertolli; Samuel F. Antao; Alexandre E. Eichenberger; Kevin O'Brien; Zehra Sura; Arpith C. Jacob; Tong Chen; Olivier Sallenave
10^{18}
computing frontiers | 2015
Zehra Sura; Arpith C. Jacob; Tong Chen; Bryan S. Rosenburg; Olivier Sallenave; Carlo Bertolli; Samuel F. Antao; José R. Brunheroto; Yoonho Park; Kevin O'Brien; Ravi Nair
floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.
ieee international conference on high performance computing data and analytics | 2015
Gheorghe-Teodor Bercea; Carlo Bertolli; Samuel F. Antao; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien
GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs. One main issue in implementing OpenMP on a GPU is related to efficiently supporting sequential and parallel regions, as GPUs are only optimized to execute highly parallel workloads. Multiple solutions to this issue were proposed in previous research. In this paper, we propose a method to coordinate threads in an NVIDIA GPU that is both efficient and easily integrated as part of a compiler. To support our claims, we developed CUDA programs that mimic multiple coordination schemes and we compare their performances. We show that a scheme based on dynamic parallelism performs poorly compared to inspector-executor schemes that we introduce in this paper. We also discuss how to integrate these schemes to the LLVM compiler infrastructure.
Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC | 2015
Carlo Bertolli; Samuel F. Antao; Gheorghe-Teodor Bercea; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien
The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.
2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) | 2016
Samuel F. Antao; Alexey Bataev; Arpith C. Jacob; Gheorghe-Teodor Bercea; Alexandre E. Eichenberger; Georgios Rokos; Matt Martineau; Tian Jin; Guray Ozen; Zehra Sura; Tong Chen; Hyojin Sung; Carlo Bertolli; Kevin O'Brien
OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite.n NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.
international workshop on openmp | 2016
Ian Karlin; Tom Scogland; Arpith C. Jacob; Samuel F. Antao; Gheorghe-Teodor Bercea; Carlo Bertolli; Bronis R. de Supinski; Erik W. Draeger; Alexandre E. Eichenberger; Jim Glosli; Holger E. Jones; Adam Kunen; David Poliakoff; David F. Richards
The LLVM community is currently developing OpenMP 4.1 support, consisting of software improvements for Clang and new runtime libraries. OpenMP 4.1 includes offloading constructs that permit execution of user selected regions on generic devices, external to the main host processor. This paper describes our ongoing work towards delivering support for OpenMP offloading constructs for the OpenPower system into the LLVM compiler infrastructure. We previously introduced a design for a control loop scheme necessary to implement the OpenMP generic offloading model on NVIDIA GPUs. In this paper we show how we integrated the complexity of the control loop into Clang by limiting its support to OpenMP-related functionality. We also synthetically report the results of performance analysis on benchmarks and a complex application kernel. We show an optimization in the Clang code generation scheme for specific code patterns, alternative to the control loop, which delivers improved performance.
IEEE Circuits and Systems Magazine | 2016
Leonel Sousa; Samuel F. Antao; P.A.F. Martins
OpenMP 4.5 allows performance portability by enabling users to write a single application code and run it on multiple types of accelerators. Our goal is to deliver a high-performance implementation of OpenMP into the Clang/LLVM project. This paper describes our initial work to fully support code generation for OpenMP device offloading constructs. We describe a new driver implementation to handle compilation for multiple host and device types, which generalizes the current Clang CUDA implementation and supports OpenMP. It can also be extended to any offloading based language including OpenCL and OpenACC. We describe an implementation of the OpenMP offloading constructs in the runtime library, giving details on two critical aspects. First, how data mapping is implemented. Second, how different device code sections in the binaries are handled to enable application execution on different devices without recompilation. We report initial performance on a prototype that extends current LLVM trunk repositories with all our proposed patches plus future ones, showing near-CUDA performance of our solution.
ieee international conference on high performance computing data and analytics | 2016
Matt J Martineau; Simon N McIntosh-Smith; Carlo Bertolli; Arpith C. Jacob; Samuel F. Antao; Alexandre E. Eichenberger; Gheorghe-Teodor Bercea; Tong Chen; Tian Jin; Kevin O'Brien; Georgios Rokos; Hyojin Sung; Zehra Sura
Many application developers need code that runs efficiently on multiple architectures, but cannot afford to maintain architecturally specific codes. With the addition of target directives to support offload accelerators, OpenMP now has the machinery to support performance portable code development. In this paper, we describe application ports of Kripke, Cardioid, and LULESH to OpenMP 4.5 and discuss our successes and failures. Challenges encountered include how OpenMP interacts with C++ including classes with virtual methods and lambda functions. Also, the lack of deep copy support in OpenMP increased code complexity. Finally, GPUs inability to handle virtual function calls required code restructuring. Despite these challenges we demonstrate OpenMP obtains performance within 10 % of hand written CUDA for memory bandwidth bound kernels in LULESH. In addition, we show with a minor change to the OpenMP standard that register usage for OpenMP code can be reduced by up to 10 %.
international workshop on openmp | 2015
Arpith C. Jacob; Ravi Nair; Alexandre E. Eichenberger; Samuel F. Antao; Carlo Bertolli; Tong Chen; Zehra Sura; Kevin O’Brien; Michael Wong
Cryptography plays a major role assuring security in computation and communication. In particular, public-key cryptography enables the asymmetrical ciphering of data along with the authentication of the parties that are attempting to share data. The computation of asymmetrical encryption is costly, thus it has motivated extensive research to efficiently accelerate the execution of the most relevant algorithms and improve resistance against Side-Channel Attacks (SCAs), which leverage exposed features by the cryptographic systems, such as power consumption and execution timings, to gain access to private information. Herein, we present a state-of-the-art overview of the use of the Residue Number System (RNS) to exploit parallelism in the computation of the most important public-key algorithms. We also address how it can be exploited to prevent side-channel attacks. The experimental results presented in the literature show that not only the currently used RSA and Elliptic Curve Cryptographic (ECC) algorithms but also emerging postquantum algorithms, namely the ones supporting Lattice-based Cryptosystems (LBCs), can take advantage of the RNS. It enables the design of more efficient cryptographic systems and also reinforces the prevention of side-channel attacks, improving their security. Finally, we also present the characteristics of the Computing with the Residue Number System Framework (CRNS), which aims to automatize the design of fully functional cryptographic accelerators based on RNS.