Abhishek Kulkarni
Indiana University Bloomington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Abhishek Kulkarni.
ieee international conference on high performance computing data and analytics | 2013
Ke Wang; Abhishek Kulkarni; Michael Lang; Dorian C. Arnold; Ioan Raicu
Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peer-to-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.
international conference on functional programming | 2012
Adam Foltzer; Abhishek Kulkarni; Rebecca Swords; Sajith Sasidharan; Eric Jiang; Ryan R. Newton
Modern parallel computing hardware demands increasingly specialized attention to the details of scheduling and load balancing across heterogeneous execution resources that may include GPU and cloud environments, in addition to traditional CPUs. Many existing solutions address the challenges of particular resources, but do so in isolation, and in general do not compose within larger systems. We propose a general, composable abstraction for execution resources, along with a continuation-based meta-scheduler that harnesses those resources in the context of a deterministic parallel programming library for Haskell. We demonstrate performance benefits of combined CPU/GPU scheduling over either alone, and of combined multithreaded/distributed scheduling over existing distributed programming approaches for Haskell.
IEEE Transactions on Parallel and Distributed Systems | 2016
Ke Wang; Abhishek Kulkarni; Michael Lang; Dorian C. Arnold; Ioan Raicu
Owing to the extreme parallelism and the high component failure rates of tomorrows exascale, high-performance computing (HPC) system software will need to be scalable, failure-resistant, and adaptive for sustained system operation and full system utilizations. Many of the existing HPC system software are still designed around a centralized server paradigm and hence are susceptible to scaling issues and single points of failure. In this article, we explore the design tradeoffs for scalable system software at extreme scales. We propose a general system software taxonomy by deconstructing common HPC system software into their basic components. The taxonomy helps us reason about system software as follows: (1) it gives us a systematic way to architect scalable system software by decomposing them into their basic components; (2) it allows us to categorize system software based on the features of these components, and finally (3) it suggests the configuration space to consider for design evaluation via simulations or real implementations. Further, we evaluate different design choices of a representative system software, i.e. key-value store, through simulations up to millions of nodes. Finally, we show evaluation results of two distributed system software, Slurm++ (a distributed HPC resource manager) and MATRIX (a distributed task execution framework), both developed based on insights from this work. We envision that the results in this article help to lay the foundations of developing next-generation HPC system software for extreme scales.
ieee international conference on high performance computing, data, and analytics | 2012
Abhishek Kulkarni; Adam Manzanares; Latchesar Ionkov; Michael Lang; Andrew Lumsdaine
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale computing will exacerbate this problem. To meet checkpointing demands and sustain application-perceived throughput at exascale, multi-tiered hierarchical storage architectures involving solid-state burst buffers are being considered. In this paper, we describe the design and implementation of cento, a multi-level, content-addressable checkpoint file system for large-scale HPC systems. cento achieves in-flight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints. Through a detailed analysis of checkpoint dumps, we assess the benefits of data reduction for scientific applications that are representative of production workloads. We observe upto 40% data reduction within a limited sample of representative workloads. Finally, experiments on existing systems show a decrease in checkpoint commit latencies by 5 to 20 % reducing the load on the parallel file system.
Proceedings of the 1st annual workshop on Functional programming concepts in domain-specific languages | 2013
Abhishek Kulkarni; Ryan R. Newton
Domain-specific languages offer programming abstractions that enable higher efficiency, productivity and portability specific to a given application domain. Domain-specific languages such as StreamIt have valuable auto-parallelizing code-generators, but they require learning a new language and tool-chain and may not integrate easily with a larger application. One solution is to transform such standalone DSLs into embedded languages within a general-purpose host language. This prospect comes with its own challenges, namely the compile-time and runtime integration of the two languages. In this paper, we address these challenges, presenting our solutions in the context of a prototype embedding of StreamIt in Haskell. By demonstrating this methodology, we hope to encourage more reuse of DSL technology, and fewer short-lived reimplementations of existing techniques.
International Conference on Exascale Applications and Software | 2014
Thomas Sterling; Matthew Anderson; P. Kevin Bohan; Maciej Brodowicz; Abhishek Kulkarni; Bo Zhang
Achieving the performance potential of an Exascale machine depends on realizing both operational efficiency and scalability in high performance computing applications. This requirement has motivated the emergence of several new programming models which emphasize fine and medium grain task parallelism in order to address the aggravating effects of asynchrony at scale. The performance modeling of Exascale systems for these programming models requires the development of fundamentally new approaches due to the demands of both scale and complexity. This work presents a performance modeling case study of the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) proxy application where the performance modeling approach has been incorporated directly into a runtime system with two modalities of operation: computation and performance modeling simulation. The runtime system exposes performance sensitivies and projects operation to larger scales while also realizing the benefits of removing global barriers and extracting more parallelism from LULESH. Comparisons between the computation and performance modeling simulation results are presented.
international workshop on runtime and operating systems for supercomputers | 2012
Abhishek Kulkarni; Andrew Lumsdaine; Michael Lang; Latchesar Ionkov
The execution of a SPMD application involves running multiple instances of a process with possibly varying arguments. With the widespread adoption of massively multicore processors, there has been a focus towards harnessing the abundant compute resources effectively in a power-efficient manner. Although much work has been done towards optimizing distributed process launch using hierarchical techniques, there has been a void in studying the performance of spawning processes within a single node. Reducing the latency to spawn a new process locally results in faster global job launch. Further, emerging dynamic and resilient execution models are designed on the premise of maintaining process pools for fault isolation and launching several processes in a relatively shorter period of time. Optimizing the latency and throughput for spawning processes would help improve the overall performance of runtime systems, allow adaptive process-replication reliability and motivate the design and implementation of process management interfaces in future manycore operating systems. In this paper, we study the several limiting factors for efficient spawning of processes on massively multicore architectures. We have developed a library to optimize launching multiple instances of the same executable. Our microbenchmarks show a 20-80% decrease in the process spawn time for multiple executables. We further discuss the effects of memory locality and propose NUMA-aware extensions to optimize launching processes with large memory-mapped segments including dynamic shared libraries. Finally, we describe vector operating system interfaces for spawning a batch of processes from a given executable on specific cores. Our results show a 50x speedup over the traditional method of launching new processes using fork and exec system calls.
high performance computing systems and applications | 2008
Abhishek Kulkarni; Andrew Lumsdaine
OSCAR has matured over the years into a very capable, yet simple high-performance clustering platform solution. Ease of building and administrating a cluster, effective resource management and scalable performance are some of the issues which have gathered a lot of focus in the HPC domain lately. This technical paper describes the integration of OSCAR and PERCEUS, which aims to enable the deployment of stateless nodes in a cluster. This promises to fill a much needed niche in performing a diskless cluster installation with OSCAR. The single point of administration and ease-of-use would help novice users as well as experienced cluster administrators to deploy clusters faster.
languages and compilers for parallel computing | 2015
Christopher S. Zakian; Timothy A. K. Zakian; Abhishek Kulkarni; Buddhika Chamith; Ryan R. Newton
Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight user threading packages. However, there is a significant gap between the two developments. In previous work--and ini?źtodays software packages--lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support standard calling conventions and stack representations. We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs--which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices.
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013
Matthew Anderson; Maciej Brodowicz; Abhishek Kulkarni; Thomas Sterling
Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking runtime system employing many lightweight, concurrent threads. Yet a priori estimation of the potential performance and scalability impact of such runtime systems on existing applications developed around the bulk synchronous parallel (BSP) model is not well understood. In this work, we present a case study of a BSP particle-in-cell benchmark code which has been ported to a many-tasking runtime system. The 3-D Gyrokinetic Toroidal code (GTC) is examined in its original MPI form and compared with a port to the High Performance ParalleX 3 (HPX-3) runtime system. Phase overlap, oversubscription behavior, and work rebalancing in the implementation are explored. Results for GTC using the SST/macro simulator complement the implementation results. Finally, an analytic performance model for GTC is presented in order to guide future implementation efforts.