Harish Patil
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Harish Patil.
programming language design and implementation | 2005
Chi-Keung Luk; Robert Cohn; Robert Muth; Harish Patil; Artur Klauser; P. Geoffrey Lowney; Steven Wallace; Vijay Janapa Reddi; Kim M. Hazelwood
Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pins rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the applications original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pins versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.
international symposium on microarchitecture | 2004
Harish Patil; Robert Cohn; Mark J. Charney; Rajiv Kapoor; Andrew Sun; Anand Karunanidhi
Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions - called Pin-Points - for simulation. We have developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrial-strength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors.
symposium on code generation and optimization | 2010
Harish Patil; Cristiano Pereira; Mack Stallcup; Gregory Lueck; James Cownie
Analysis of parallel programs is hard mainly because their behavior changes from run to run. We present an execution capture and deterministic replay system that enables repeatable analysis of parallel programs. Our goal is to provide an easy-to-use framework for capturing, deterministically replaying, and analyzing execution of large programs with reasonable runtime and disk usage. Our system, called PinPlay, is based on the popular Pin dynamic instrumentation system hence is very easy to use. PinPlay extends the capability of Pin-based analysis by providing a tool for capturing one execution instance of a program (as log files called pinballs) and by allowing Pin-based tools to run off the captured execution. Most Pintools can be trivially modified to work off pinballs thus doing their usual analysis but with a guaranteed repeatability. Furthermore, the capture/replay works across operating systems (Windows to Linux) as the pinball format is independent of the operating system. We have used PinPlay to analyze and deterministically debug large parallel programs running trillions of instructions. This paper describes the design of PinPlay and its applications for analyses such as simulation point selection, tracing, and debugging.
Software - Practice and Experience | 1997
Harish Patil; Charles N. Fischer
Illegal pointer and array accesses are a major cause of failure for C programs. We present a technique called ‘guarding’ to catch illegal array and pointer accesses. Our implementation of guarding for C programs works as a source‐to‐source translator. Auxiliary objects called guards are added to a user program to monitor pointer and array accesses at run time. Guards maintain attributes to catch out of bounds array accesses and accesses to deallocated memory. Our system has found a number of previously unreported errors in widely‐used Unix utilities and SPEC92 benchmarks. Many commonly used programs have bugs which may not always manifest themselves as a program crash, but may instead produce a subtly wrong answer. These programs are not routinely checked for run‐time errors because the increase in execution time due to run‐time checking can be very high. We present two techniques to handle the high cost of run‐time checking of pointer and array accesses in C programs: ‘customization’ and ‘shadow processing’. Customization works by decoupling run‐time checking from original computation. A user program is customized for guarding by throwing away computation not relevant for guarding. We have explored using program slicing for customization. Customization can cut the overhead of guarding by up to half. Shadow processing uses idle processors in multiprocessor workstations to perform run‐time checking in the background. A user program is instrumented to obtain a ‘main process’ and a ‘shadow process’. The main process performs computations from the orignal program, occasionally communicating a few key values to the shadow process. The shadow process follows the main process, checking pointer and array accesses. The overhead to the main process which the user sees is very low – almost always less than 10%.
IEEE Computer | 2010
Moshe Bach; Mark J. Charney; Robert Cohn; Elena Demikhovsky; Tevi Devor; Kim M. Hazelwood; Aamer Jaleel; Chi-Keung Luk; Gail Lyons; Harish Patil; Ady Tal
Software instrumentation provides the means to collect information on and efficiently analyze parallel programs. Using Pin, developers can build tools to detect and examine dynamic behavior including data races, memory system behavior, and parallelizable loops. Pin is a software system that performs runtime binary instrumentation of Linux and Microsoft Windows applications. Pins aim is to provide an instrumentation platform for building a wide variety of program analysis tools, called pintools. By performing the instrumentation on the binary at runtime, Pin eliminates the need to modify or recompile the applications source and supports the instrumentation of programs that dynamically generate code.
measurement and modeling of computer systems | 2006
Satish Narayanasamy; Cristiano Pereira; Harish Patil; Robert Cohn; Brad Calder
Modern architecture research relies heavily on application-level detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional correctness of the applications execution. Existing application-level simulators require manually hand coding the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the applications execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch.In this paper, we describe a tool that can automatically log operating system effects to guide architecture simulation of application code. The benefits of our approach are: (a) we do not have to build or maintain any infrastructure for emulating the operating system effects, (b) we can support simulation of more complex applications on our application-level simulator, including those applications that use asynchronous interrupts, DMA transfers, etc., and (c) using the system effects logs collected by our tool, we can deterministically re-execute the application to guide architecture simulation that has reproducible results.
symposium on code generation and optimization | 2004
Chi-Keung Luk; Robert Muth; Harish Patil; Robert Cohn; P. Geoffrey Lowney
Ispike is a post-link optimizer developed for the Intel/spl reg/ Itanium Processor Family (IPF) processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40% on the ltanium/spl reg/2 processor, with average improvement of 8.5% and 9.9% over executables generated by the Intel/spl reg/ Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters.
international conference on supercomputing | 2002
Chi-Keung Luk; Robert Muth; Harish Patil; Richard Weiss; P. Geoffrey Lowney; Robert Cohn
Data prefetching is an effective approach to addressing the memory latency problem. While a few processors have implemented hardware-based data prefetching, the majority of modern processors support data-prefetch instructions and rely on compilers to automatically insert prefetches. However, most prefetching schemes in commercial compilers suffer from two limitations: (1) the source code must be available before prefetching can be applied, and (2) these prefetching schemes target only loops with statically-known strided accesses. In this study, we broaden the scope of software-controlled prefetching by addressing the above two limitations. We use profiling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler. We then use the strides discovered to insert prefetches into the executable directly, without the need for re-compilation. Performance evaluation was done on an Alpha 21264-based system with a 64KB data cache and an 8MB secondary cache. We find that even with such large caches, our technique offers speedups ranging from 3% to 56% in 11 out of the 26 SPEC2000 benchmarks. Our technique has been incorporated into Pixie and Spike, two products in Compaqs Tru64 Unix.
programming language design and implementation | 1999
Le Chun Wu; Rajiv Mirani; Harish Patil; Bruce A. Olsen; Wen-mei W. Hwu
With an increasing number of executable binaries generated by optimizing compilers today, providing a clear and correct source-level debugger for programmers to debug optimized code has become a necessity. In this paper, a new framework for debugging globally optimized code is proposed. This framework consists of a new code location mapping scheme, a data location tracking scheme, and an emulation-based forward recovery model. By taking over the control early and emulating instructions selectively, the debugger can preserve and gather the required program state for the recovery of expected variable values at source breakpoints. The framework has been prototyped in the IMPACT compiler and GDB-4.16. Preliminary experiments conducted on several SPEC95 integer programs have yielded encouraging results. The extra time needed for the debugger to calculate the limits of the emulated region and to emulate instructions is hardly noticeable, while the increase in executable file size due to the extra debug information is on average 76% of that of the executable file with no debug information.
symposium on code generation and optimization | 2014
Yan Wang; Harish Patil; Cristiano Pereira; Gregory Lueck; Rajiv Gupta; Iulian Neamtiu
We present a collection of tools, DrDebug, that greatly advances the state-of-the-art of cyclic, interactive debugging of multi-threaded programs based upon the record and replay paradigm. The features of DrDebug significantly increase the efficiency of debugging by tailoring the scope of replay to a buggy execution region or an execution slice of a buggy region. In addition to supporting traditional debugger commands, DrDebug provides commands for recording, replaying, and dynamic slicing with several novel features. First, upon a users request, a highly precise dynamic slice is computed that can then be browsed by the user by navigating the dynamic dependence graph with the assistance of our graphical user interface. Second, a dynamic slice of interest to the user can be used to compute an execution slice whose replay can then be carried out. Due to narrow scope, the replay can be performed efficiently as execution of code segments that do not belong to the execution slice is skipped. We also provide the capability of allowing the user to step from the execution of one statement in the slice to the next while examining the values of variables. To the best of our knowledge, this capability cannot be found in any other slicing tool. We have also integrated DrDebug with the Maple tool that exposes bugs and records buggy executions for replay. Our experiments demonstrate DrDebugs practicality.