Milind Chabbi
Rice University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Milind Chabbi.
international parallel and distributed processing symposium | 2013
Sanjay Chatterjee; Sagnak Tasirlar; Zoran Budimlic; Vincent Cavé; Milind Chabbi; Max Grossman; Vivek Sarkar; Yonghong Yan
Effective combination of inter-node and intra-node parallelism is recognized to be a major challenge for future extreme-scale systems. Many researchers have demonstrated the potential benefits of combining both levels of parallelism, including increased communication-computation overlap, improved memory utilization, and effective use of accelerators. However, current “hybrid programming” approaches often require significant rewrites of application code and assume a high level of programmer expertise. Dynamic task parallelism has been widely regarded as a programming model that combines the best of performance and programmability for shared-memory programs. For distributed-memory programs, most users rely on efficient implementations of MPI. In this paper, we propose HCMPI (Habanero-C MPI), an integration of the Habanero-C dynamic task-parallel programming model with the widely used MPI message-passing interface. All MPI calls are treated as asynchronous tasks in this model, thereby enabling unified handling of messages and tasking constructs. For programmers unfamiliar with MPI, we introduce distributed data-driven futures (DDDFs), a new data-flow programming model that seamlessly integrates intra-node and inter-node data-flow parallelism without requiring any knowledge of MPI. Our novel runtime design for HCMPI and DDDFs uses a combination of dedicated communication and computation specific worker threads. We evaluate our approach on a set of micro-benchmarks as well as larger applications and demonstrate better scalability compared to the most efficient MPI implementations, while offering a unified programming model to integrate asynchronous task parallelism with distributed-memory parallelism.
acm sigplan symposium on principles and practice of parallel programming | 2015
Milind Chabbi; Mike Fagan; John M. Mellor-Crummey
Efficient locking mechanisms are critically important for high performance computers. On highly-threaded systems with a deep memory hierarchy, the throughput of traditional queueing locks, e.g., MCS locks, falls off due to NUMA effects. Two-level cohort locks perform better on NUMA systems, but fail to deliver top performance for deep NUMA hierarchies. In this paper, we describe a hierarchical variant of the MCS lock that adapts the principles of cohort locking for architectures with deep NUMA hierarchies. We describe analytical models for throughput and fairness of Cohort-MCS (C-MCS) and Hierarchical MCS (HMCS) locks that enable us to tailor these locks for high performance on any target platform without empirical tuning. Using these models, one can select parameters such that an HMCS lock will deliver better fairness than a C-MCS lock for a given throughput, or deliver better throughput for a given fairness. Our experiments show that, under high contention, a three-level HMCS lock delivers up to 7.6x higher lock throughput than a C-MCS lock on a 128-thread IBM Power 755 and a five-level HMCS lock delivers up to 72x higher lock throughput on a 4096-thread SGI UV 1000. On the K-means clustering code from the MineBench suit, a three-level HMCS lock reduces the running time by up to 55% compared to the C-MCS lock on a IBM Power 755.
acm sigplan symposium on principles and practice of parallel programming | 2016
Milind Chabbi; John M. Mellor-Crummey
Over the last decade, the growing use of cache-coherent NUMA architectures has spurred the development of numerous locality-preserving mutual exclusion algorithms. NUMA-aware locks such as HCLH, HMCS, and cohort locks exploit locality of reference among nearby threads to deliver high lock throughput under high contention. However, the hierarchical nature of these locality-aware locks increases latency, which reduces the throughput of uncontended or lightly-contended critical sections. To date, no lock design for NUMA systems has delivered both low latency under low contention and high throughput under high contention. In this paper, we describe the design and evaluation of an adaptive mutual exclusion scheme (AHMCS lock), which employs several orthogonal strategies---a hierarchical MCS (HMCS) lock for high throughput under high contention, Lamports fast path approach for low latency under low contention, an adaptation mechanism that employs hysteresis to balance latency and throughput under moderate contention, and hardware transactional memory for lowest latency in the absence of contention. The result is a top performing lock that has most properties of an ideal mutual exclusion algorithm. AHMCS exploits the strengths of multiple contention management techniques to deliver high performance over a broad range of contention levels. Our empirical evaluations demonstrate the effectiveness of AHMCS over prior art.
ieee international conference on high performance computing data and analytics | 2013
Milind Chabbi; K. V. R. Murthy; Mike Fagan; John M. Mellor-Crummey
Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.
symposium on code generation and optimization | 2012
Milind Chabbi; John M. Mellor-Crummey
Software systems often suffer from various kinds of performance inefficiencies resulting from data structure choice, lack of design for performance, and ineffective compiler optimization. Avoiding unnecessary operations, and in particular memory accesses, is desirable. In this paper, we describe DeadSpy --- a tool that dynamically detects every dead write to memory in a given execution and provides actionable feedback to the programmer. This tool provides a methodical way to identify dead writes, which is a common symptom of performance inefficiencies. Our analysis of the SPEC CPU2006 benchmarks showed that the fraction of dead writes is surprisingly high. In fact, we observed that the SPEC CPU2006 gcc benchmark has 61% dead writes on average across its reference inputs. DeadSpy pinpoints source lines contributing to such inefficiencies. In several case studies with high dead writes, simple code restructuring to eliminate dead writes improved their performance significantly. For gcc, avoiding dead writes improved its running time by as much as 28% for some inputs and 14% on average. We recommend dead write elimination as an important step in performance tuning.
symposium on code generation and optimization | 2014
Milind Chabbi; Xu Liu; John M. Mellor-Crummey
Fine-grained binary instrumentation is a popular technique to monitor program execution. Intels Pin is a leading dynamic binary instrumentation framework for building program measurement and analysis tools. A key feature missing in Pin is the ability to associate call paths with instructions as they execute. The availability of calling context information enables Pin tools to provide more detailed diagnostic feedback. This paper introduces CCTLib---a call path collection library that any Pin tool can use to obtain the full calling context at any and every machine instruction that executes. CCTLib not only associates any instruction with source code along the call path, but also points to the data object accessed by the instruction if it is a memory access. With CCTLib, we demonstrate that collecting call paths on each executed instruction is possible, even for reasonably long running programs. Prior art in call path collection for Pin has several limitations. Compared to other open-source Pin tools for call path collection, CCTLib provides richer information that is accurate even for programs with complex control flow and does so with about 30% less overhead---a difference of 14× on average. CCTLib enables attribution of metrics in Pin tools to both code and data.
IEEE Transactions on Parallel and Distributed Systems | 2016
Ashwin M. Aji; Lokendra S. Panwar; Feng Ji; K. V. R. Murthy; Milind Chabbi; Pavan Balaji; Keith R. Bisset; James Dinan; Wu-chun Feng; John M. Mellor-Crummey; Xiaosong Ma; Rajeev Thakur
Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACCs runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era | 2011
Milind Chabbi; John M. Mellor-Crummey; Keith D. Cooper
Most compilers apply optimizations in a fixed order regardless of input programs. However, it is well known that optimizations can have enabling, and disabling interactions or equivalent effects. The effects of interference are program specific and hence no single sequence is universally appropriate for all input programs. In this paper we explore the problem of searching for optimal sequences of compiler optimizations to apply for a given program and describe novel strategies that bring us a step closer to searching this problem space efficiently. We also construct models for accurately predicting the runtime performance of a program when a sequence of optimizations is applied to it. The early results of the models on a small set of input programs are encouraging and suggest that the approaches we describe are worthy of further consideration.
architectural support for programming languages and operating systems | 2017
Shasha Wen; Milind Chabbi; Xu Liu
Complex code bases with several layers of abstractions have abundant inefficiencies that affect the execution time. Value redundancy is a kind of inefficiency where the same values are repeatedly computed, stored, or retrieved over the course of execution. Not all redundancies can be easily detected or eliminated with compiler optimization passes due to the inherent limitations of the static analysis. Microscopic observation of whole executions at instruction- and operand-level granularity breaks down abstractions and helps recognize redundancies that masquerade in complex programs. We have developed REDSPY---a fine-grained profiler to pinpoint and quantify redundant operations in program executions. Value redundancy may happen over time at same locations or in adjacent locations, and thus it has temporal and spatial locality. REDSPY identifies both temporal and spatial value locality. Furthermore, REDSPY is capable of identifying values that are approximately the same, enabling optimization opportunities in HPC codes that often use floating point computations. REDSPY provides intuitive optimization guidance by apportioning redundancies to their provenance---source lines and execution calling contexts. REDSPY pinpointed dramatically high volume of redundancies in programs that were optimization targets for decades, such as SPEC CPU2006 suite, Rodinia benchmark, and NWChem---a production computational chemistry code. Guided by REDSPY, we were able to eliminate redundancies that resulted in significant speedups.
acm sigplan symposium on principles and practice of parallel programming | 2015
Milind Chabbi; W. Lavrijsen; Wibe A. de Jong; Koushik Sen; John M. Mellor-Crummey; Costin Iancu
Large scientific code bases are often composed of several layers of runtime libraries, implemented in multiple programming languages. In such situation, programmers often choose conservative synchronization patterns leading to suboptimal performance. In this paper, we present context-sensitive dynamic optimizations that elide barriers redundant during the program execution. In our technique, we perform data race detection alongside the program to identify redundant barriers in their calling contexts; after an initial learning, we start eliding all future instances of barriers occurring in the same calling context. We present an automatic on-the-fly optimization and a multi-pass guided optimization. We apply our techniques to NWChem--a 6 million line computational chemistry code written in C/C++/Fortran that uses several runtime libraries such as Global Arrays, ComEx, DMAPP, and MPI. Our technique elides a surprisingly high fraction of barriers (as many as 63%) in production runs. This redundancy elimination translates to application speedups as high as 14% on 2048 cores. Our techniques also provided valuable insight about the application behavior, later used by NWChem developers. Overall, we demonstrate the value of holistic context-sensitive analyses that consider the domain science in conjunction with the associated runtime software stack.