Keshav Pingali | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keshav Pingali is active.

Explore More

Publication

Featured researches published by Keshav Pingali.

ACM Transactions on Programming Languages and Systems | 1989

I-structures: data structures for parallel computing

Arvind; Rishiyur S. Nikhil; Keshav Pingali

It is difficult to achieve elegance, efficiency, and parallelism simultaneously in functional programs that manipulate large data structures. We demonstrate this through careful analysis of program examples using three common functional data-structuring approaches-lists using Cons, arrays using Update (both fine-grained operators), and arrays using make-array (a “bulk” operator). We then present I-structure as an alternative and show elegant, efficient, and parallel solutions for the program examples in Id, a language with I-structures. The parallelism in Id is made precise by means of an operational semantics for Id as a parallel reduction system. I-structures make the language nonfunctional, but do not lose determinacy. Finally, we show that even in the context of purely functional languages, I-structures are invaluable for implementing functional data abstractions.

programming language design and implementation | 1997

Data-centric multi-level blocking

Induprakas Kodukula; Nawaaz Ahmed; Keshav Pingali

We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy. Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our data-centric transformations permit a more direct solution to the problem of enhancing data locality than current control-centric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it attractive to compiler writers as well as to library writers.

programming language design and implementation | 1989

Process decomposition through locality of reference

Anne Rogers; Keshav Pingali

In the context of sequential computers, it is common practice to exploit temporal locality of reference through devices such as caches and virtual memory. In the context of multiprocessors, we believe that it is equally important to exploit spatial locality of reference. We are developing a system which, given a sequential program and its domain decomposition, performs process decomposition so as to enhance spatial locality of reference. We describe an application of this method - generating code from shared-memory programs for the (distributed memory) Intel iPSC/2.

ieee international symposium on workload characterization | 2012

A quantitative study of irregular programs on GPUs

Martin Burtscher; Rupesh Nasre; Keshav Pingali

GPUs have been used to accelerate many regular applications and, more recently, irregular applications in which the control flow and memory access patterns are data-dependent and statically unpredictable. This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. For a suite of 13 benchmarks, we find that (i) irregularity at the warp level varies widely, (ii) control-flow irregularity and memory-access irregularity are largely independent of each other, and (iii) most kernels, including regular ones, exhibit some irregularity. A programs irregularity can change between different inputs, systems, and arithmetic precision but generally stays in a specific region of the irregularity space. Whereas some highly tuned implementations of irregular algorithms exhibit little irregularity, trading off extra irregularity for better locality or less work can improve overall performance.

acm sigplan symposium on principles and practice of parallel programming | 2003

Automated application-level checkpointing of MPI programs

Greg Bronevetsky; Daniel Marques; Keshav Pingali; Paul Stodghill

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

symposium on operating systems principles | 2013

A lightweight infrastructure for graph analytics

Donald Nguyen; Andrew Lenharth; Keshav Pingali

Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this paper, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation called the Galois system. We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tackled by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.

programming language design and implementation | 1994

The program structure tree: computing control regions in linear time

Richard Johnson; David Pearson; Keshav Pingali

In this paper, we describe the program structure tree (PST), a hierarchical representation of program structure based on single entry single exit (SESE) regions of the control flow graph. We give a linear-time algorithm for finding SESE regions and for building the PST of arbitrary control flow graphs (including irreducible ones). Next, we establish a connection between SESE regions and control dependence equivalence classes, and show how to use the algorithm to find control regions in linear time. Finally, we discuss some applications of the PST. Many control flow algorithms, such as construction of Static Single Assignment form, can be speeded up by applying the algorithms in a divide-and-conquer style to each SESE region on its own. The PST is also used to speed up data flow analysis by exploiting “sparsity”. Experimental results from the Perfect Club and SPEC89 benchmarks confirm that the PST approach finds and exploits program structure.

international symposium on microarchitecture | 1993

Register renaming and dynamic speculation: an alternative approach

Mayan Moudgill; Keshav Pingali; Stamatis Vassiliadis

Presents a novel microparallel taxonomy for machines with multiple-instruction processing capabilities including VLIW, superscalar, and decoupled machines. The taxonomy is based upon the static or dynamic behavior of four abstract, operational stages that an instruction passes through. These stages are fetch, decode, execute, and retire. This two valued, four variable taxonomy results in sixteen ways that a processors microarchitecture can be specified. The paper categorizes different machine instances that are either actual implementations or proposed systems within the taxonomy framework. Four new processor microarchitectures are postulated which provide additional features and are instances of the remaining unexplored microparallel classifications.<<ETX>>

Proceedings of the IEEE | 2005

Is Search Really Necessary to Generate High-Performance BLAS?

Kamen Yotov; Xiaoming Li; Gang Ren; María Jesús Garzarán; David A. Padua; Keshav Pingali; Paul Stodghill

A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective and can generate code with performance comparable to that of code generated by ATLAS using global search.

programming language design and implementation | 1993

Dependence-based program analysis

Richard Johnson; Keshav Pingali

Program analysis and optimization can be speeded up through the use of the dependence flow graph (DFG), a representation of program dependences which generalizes def-use chains and static single assignment (SSA) form. In this paper, we give a simple graph-theoretic description of the DFG and show how the DFG for a program can be constructed in O(EV) time. We then show how forward and backward dataflow analyses can be performed efficiently on the DFG, using constant propagation and elimination of partial redundancies as examples. These analyses can be framed as solutions of dataflow equations in the DFG. Our construction algorithm is of independent interest because it can be used to construct a programs control dependence graph in O(E) time and its SSA representation in O(EV) time, which are improvements over existing algorithms.

Explore More