Ruchira Sasanka | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ruchira Sasanka is active.

Explore More

Publication

Featured researches published by Ruchira Sasanka.

Computer Physics Communications | 2017

An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes ☆

Henri Vincenti; Mathieu Lobet; R. Lehe; Ruchira Sasanka; Jean-Luc Vay

Author(s): Vincenti, H; Lobet, M; Lehe, R; Sasanka, R; Vay, JL | Abstract:

application-specific systems, architectures, and processors | 2016

Bridging the FPGA programmability-portability Gap via automatic OpenCL code generation and tuning

Konstantinos Krommydas; Ruchira Sasanka; Wu-chun Feng

Programming FPGAs has been an arduous task that requires extensive knowledge of hardware design languages (HDLs), such as Verilog or VHDL, and low-level hardware details. With OpenCL support for FPGAs, the design, prototyping and implementation of an FPGA is increasingly moving towards a much higher level of abstraction, when compared to the intrinsically low-level nature of HDLs. On the other hand, in the context of traditional (i.e., CPU) software development, OpenCL is still considered to be low-level and complex because the programmer needs to manually expose parallelism in the code. In this work, we present our approach to enhancing FPGA programmability via GLAF, a visual programming framework, to automatically generate synthesizable OpenCL code with an array of FPGA-specific optimizations. We find that our tool facilitates the development process and produces functionally correct and well-performing code on the FPGA for our molecular modeling, gene sequence search, and filtering algorithms.

international conference on parallel processing | 2015

GLAF: A Visual Programming and Auto-tuning Framework for Parallel Computing

Konstantinos Krommydas; Ruchira Sasanka; Wu-chun Feng

The past decades computing revolution has delivered parallel hardware to the masses. However, the ability to exploit its capabilities and ignite scientific breakthrough at a proportionate level remains a challenge due to the lack of parallel programming expertise. Although different solutions have been proposed to facilitate harvesting the seeds of parallel computing, most target seasoned programmers and ignore the special nature of a target audience like domain experts. This paper addresses the challenge of realizing a programming abstraction and implementing an integrated development framework for this audience. We present GLAF -- a grid-based language and auto-parallelizing, auto-tuning framework. Its key elements are its intuitive visual programming interface, which attempts to render expressing and validating an algorithm easier for domain experts, and its ability to automatically generate efficient serial and parallel Fortran and C code, including potentially beneficial code modifications (e.g., With respect to data layout). We find that the above features assist novice programmers to avoid common programming pitfalls and provide fast implementations.

international conference on high performance computing and simulation | 2016

Enhancing application performance using heterogeneous memory architectures on a many-core platform

Shuo Li; Karthik Raman; Ruchira Sasanka

The 2nd generation Intel® Xeon Phi processor (codenamed Knights Landing) is Intels first self-booting Xeon Phi processor that is aimed at the HPC market. Like its predecessor, KNL is a many-core, highly threaded processor featuring an innovative on-die mesh interconnect and an on-package high bandwidth memory MCDRAM in addition to DRAM DDR-2400, which makes it possible for many HPC applications to achieve much higher performance by leveraging heterogeneous memory configuration. In this paper, we look at the programming challenges for software developers to create and manipulate data using different memory modes and a heap management API to satisfy the ever-increasing demand for high bandwidth and low latency. We start with a functional KNL architecture introduction with an emphasis on the memory subsystem and memory usage model, followed by the utility tools required to run the applications under various scenarios. We then present a profiler-based heterogeneous memory optimization framework for all memory-bandwidth-intensive applications. The new memory object features in Intel® VTune™ Amplifier will be introduced and discussed. Finally, we show how to leverage different kinds of memory by using a user extensible memory heap management API also known as memkind API. Throughout our discussions, we will use a classic streaming application in quantitative finance, the Black-Scholes benchmark. We show how to highlight the memory bottleneck using the new memory profiling features in the Intel VTune Amplifier and how to achieve high bandwidth by removing the bottleneck and by allocating memory threads between different types of memory. In the end, we show the peak performance we can achieve on KNL by using a combination of MCDRAM and DDR.

ieee international conference on high performance computing, data, and analytics | 2016

Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the BerkeleyGW Software

Jack Deslippe; Felipe H. da Jornada; Derek Vigil-Fowler; Taylor Barnes; Nathan Wichmann; Karthik Raman; Ruchira Sasanka; Steven G. Louie

We profile and optimize calculations performed with the BerkeleyGW [2, 3] code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW method is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.

international conference on parallel processing | 2018

A Framework for Auto-Parallelization and Code Generation: An Integrative Case Study with Legacy FORTRAN Codes

Konstantinos Krommydas; Paul Sathre; Ruchira Sasanka; Wu-chun Feng

GLAF, short for Grid-based Language and Auto-parallelization Framework, is a programming framework that seeks to democratize parallel programming by facilitating better productivity in parallel computing via an intuitive graphical programming interface (GPI) that automatically parallelizes and generates code in many languages. Originally, GLAF addressed program development from scratch via the GPI; but this unduly restricted GLAFs utility to creating new codes only. Thus, this paper extends GLAF by enabling program development from pre-existing kernels of interest, which can then be easily and transparently integrated into existing legacy codes. Specifically, we address the theoretical and practical limitations of integration and interoperability of auto-generated parallel code within existing FORTRAN codes; enhance GLAF to overcome these limitations; and present an integrative case study and evaluation of the enhanced GLAF via the implementation of important kernels in two NASA codes: (1) the Synoptic Surface & Atmospheric Radiation Budget (SARB), part of the Clouds and the Earths Radiant Energy System (CERES), and (2) the Fully Unstructured Navier-Stokes (FUN3D) suite for computational fluid dynamics.

Computer Physics Communications | 2018

Large-scale GW calculations on pre-exascale HPC systems

Mauro Del Ben; Felipe H. da Jornada; Andrew Canning; Nathan Wichmann; Karthik Raman; Ruchira Sasanka; Chao Yang; Steven G. Louie; Jack Deslippe

Abstract The ab initio GW approach is a rigorous Green’s-function-based framework that can be employed to compute electronic excitation properties of a wide variety of materials such as extended systems, molecules, as well as confined and nanostructured materials with a very satisfactory accuracy. However, GW calculations on complex systems are often hindered by the high computational cost associated with the method. Here, we demonstrate how to significantly speedup GW calculations with a novel algorithm for the computationally intense kernel based on a non-blocking chunked cyclic communication scheme that minimizes latency in MPI messages, allows for overlapping of communication and computation, and improves cache usage in matrix multiplication operations. The optimized version of the code, implemented in the BerkeleyGW software package, is capable of scaling well to the full Cori computer at NERSC (Cray XC40) and achieves over 11 Peta FLOP/s of sustained performance. We showcase our work by performing large-scale GW calculations of defect structures on silicon, which require simulation cells containing over 1700 atoms, and which can now be efficiently executed in just a few minutes on large pre-exascale high-performance computing systems.

Archive | 2012

Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program to multiple parallel threads

David J. Sager; Ruchira Sasanka; Ron Gabor; Shlomo Raikin; Joseph Nuzman; Leeor Peled; Jason A. Domer; Ho-Seop Kim; Youfeng Wu; Koichi Yamada; Tin-Fook Ngai; Howard H. Chen; Jayaram Bobba; Jeffrey J. Cook; Osmar M. Shaikh; Suresh Srinivas

Archive | 2014