Vinod Kathail | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vinod Kathail is active.

Explore More

Publication

Featured researches published by Vinod Kathail.

IEEE Computer | 2002

PICO: automatically designing custom computers

Vinod Kathail; Shail Aditya; Robert Schreiber; B. Ramakrishna Rau; Darren C. Cronquist; Mukund Sivaraman

The paper discusses the PICO (program in, chip out) project, a long-range HP Labs research effort that aims to automate the design of optimized, application-specific computing systems - thus enabling the rapid and cost-effective design of custom chips when no adequately specialized, off-the-shelf design is available. PICO research takes a systematic approach to the hierarchical design of complex systems and advances technologies for automatically designing custom nonprogrammable accelerators and VLIW processors. While skeptics often assume that automated design must emulate human designers who invent new solutions to problems, PICOs approach is to automatically pick the most suitable designs from a well-engineered space of designs. Such automation of embedded computer design promises an era of yet more growth in the number and variety of innovative smart products by lowering the barriers of design time, designer availability, and design cost.

signal processing systems | 2002

PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators

Robert Schreiber; Shail Aditya; Scott A. Mahlke; Vinod Kathail; B. Ramakrishna Rau; Darren C. Cronquist; Mukund Sivaraman

The PICO-NPA system automatically synthesizes nonprogrammable accelerators (NPAs) to be used as co-processors for functions expressed as loop nests in C. The NPAs it generates consist of a synchronous array of one or more customized processor datapaths, their controller, local memory, and interfaces. The user, or a design space exploration tool that is a part of the full PICO system, identifies within the application a loop nest to be implemented as an NPA, and indicates the performance required of the NPA by specifying the number of processors and the number of machine cycles that each processor uses per iteration of the inner loop. PICO-NPA emits synthesizable HDL that defines the accelerator at the register transfer level (RTL). The system also modifies the users application software to make use of the generated accelerator.The main objective of PICO-NPA is to reduce design cost and time, without significantly reducing design quality. Design of an NPA and its support software typically requires one or two weeks using PICO-NPA, which is a many-fold improvement over the industry norm. In addition, PICO-NPA can readily generate a wide-range of implementations with scalable performance from a single specification. In experimental comparison of NPAs of equivalent throughput, PICO-NPA designs are slightly more costly than hand-designed accelerators.Logic synthesis and place-and-route have been performed successfully on PICO-NPA designs, which have achieved high clock rates.

application-specific systems, architectures, and processors | 2000

High-level synthesis of nonprogrammable hardware accelerators

Robert Schreiber; Shail Aditya; B. Ramakrishna Rau; Vinod Kathail; Scott A. Mahlke; Santosh G. Abraham; Greg Snider

The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (very-long instruction word) processors, their controller local memory, and interfaces. The system also modifies the users application software to make use of the generated accelerator. The user indicates the throughput to be achieved by specifying the number of processors and their initiation interval. In experimental comparisons, PICO-N designs are slightly more costly than hand-designed accelerators with the same performance.

international symposium on systems synthesis | 1999

Automatic architectural synthesis of VLIW and EPIC processors

Shail Aditya; B. Ramakrishna Rau; Vinod Kathail

The paper describes a mechanism for automatic design and synthesis of very long instruction word (VLIW), and its generalization, explicitly parallel instruction computing (EPIC) processor architectures starting from an abstract specification of their desired functionality. The process of architecture design makes concrete decisions regarding the number and types of functional units, number of read/write ports on register files, the datapath interconnect, the instruction format, its decoding hardware, and the instruction unit datapath. The processor design is then automatically synthesized into a detailed RTL-level structural model in VHDL, along with an estimate of its area. The system also generates the corresponding detailed machine description and instruction format description that can be used to retarget a compiler and an assembler respectively. All this is part of an overall design system, called Program-In-Chip Out (PICO), which has the ability to perform automatic exploration of the architectural design space while customizing the architecture to a given application and making intelligent, quantitative, cost-performance tradeoffs.

international symposium on microarchitecture | 1994

Height reduction of control recurrences for ILP processors

Michael S. Schlansker; Vinod Kathail; Sadun Anik

The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. While height reduction techniques are not always helpful, their utility can be demonstrated in a broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

Design Automation for Embedded Systems | 1999

Machine-Description Driven Compilers for EPIC and VLIW Processors

B. Ramakrishna Rau; Vinod Kathail; Shail Aditya

In the past, due to the restricted gate count available on an inexpensive chip, embedded DSPs have had limited parallelism, few registers and irregular, incomplete interconnectivity. More recently, with increasing levels of integration, embedded VLIW processors have started to appear. Such processors typically have higher levels of instruction-level parallelism, more registers, and a relatively regular interconnect between the registers and the functional units. The central challenges faced by a code generator for an EPIC (Explicitly Parallel Instruction Computing) or VLIW processor are quite different from those for the earlier DSPs and, consequently, so is the structure of a code generator that is designed to be easily retargetable. In this paper, we explain the nature of the challenges faced by an EPIC or VLIW compiler and present a strategy for performing code generation in an incremental fashion that is best suited to generating high-quality code efficiently. We also describe the Operation Binding Lattice, a formal model for incrementally binding the opcodes and register assignments in an EPIC code generator. As we show, this reflects the phase structure of the EPIC code generator. It also defines the structure of the machine-description database, which is queried by the code generator for the information that it needs about the target processor. Lastly, we discuss our implementation of these ideas and techniques in Elcor, our EPIC compiler research infrastructure.

languages and compilers for parallel computing | 1993

Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism

Michael S. Schlansker; Vinod Kathail

This paper describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked back-substitution, which has lower operation count and higher performance than previous methods. The blocked back-substitution technique requires unrolling and non-symmetric optimization of innermost loop iterations. We present metrics to characterize the performance of software-pipelined loops and compare these metrics for a range of height reduction techniques and processor architectures.

International Journal of Parallel Programming | 1996

Parallelization of Control Recurrences for ILP Processors

Michael S. Schlansker; Vinod Kathail; Sadun Anik

The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. The utility of these techniques can be demonstrated in an increasingly broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

Archive | 2008

Algorithmic Synthesis Using PICO

Shail Aditya; Vinod Kathail

The increasing SoC complexity and a relentless pressure to reduce time-to-market have left the hardware and system designers with an enormous design challenge. The bulk of the effort in designing an SoC is focused on the design of product-defining application engines such as video codecs and wireless modems. Automatic synthesis of such application engines from a high level algorithmic description can significantly reduce both design time and design cost. This chapter reviews high level requirements for such a system and then describes the PICO (Program-In, Chip-Out) system, which provides an integrated framework for the synthesis and verification of application engines from high level C algorithms. PICOs novel approach relies on aggressive compiler technology, a parallel execution model based on Kahn process networks, and a carefully designed hardware architecture template that is cost-efficient, provides high performance, and is sensitive to circuit level and system level design constraints. PICO addresses the complete hardware design flow including architecture exploration, RTL design, RTL verification, system validation and system integration. For a large class of modern embedded applications, PICOs approach has been shown to yield extremely competitive designs at a fraction of the resources used traditionally thereby closing the proverbial design productivity gap.

international symposium on microarchitecture | 1998

Meld scheduling: a technique for relaxing scheduling constraints

Santosh G. Abraham; Vinod Kathail; Brian L. Deitrich

Meld scheduling melds the schedules of neighboring scheduling regions to respect latencies of operations issued in one region but completing after control transfers to the other. In contrast, conventional schedulers ignore latency constraints from other regions leading to potentially avoidable stalls in an interlocked (superscalar) machine or incorrect schedules for noninterlocked (VLIW) machines. Alternatively, schedulers that conservatively require all operations to complete before the branch takes effect produce inefficient schedules. In this paper, we present general data structures for maintaining latency constraint information at region boundaries. We present a meld scheduling algorithm for noninterlocked processors that generates latency constraints at the boundaries of scheduled regions and utilizes this information during the scheduling of other regions. We present a range of design options and describe the reasons behind our particular choices. We evaluate the performance of meld scheduling on a range of machine models on a set of SPEC92 and UNIX benchmarks.

Explore More