Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ramesh Peri is active.

Publication


Featured researches published by Ramesh Peri.


symposium on code generation and optimization | 2007

Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Tipp Moseley; Alex Shye; Vijay Janapa Reddi; Dirk Grunwald; Ramesh Peri

In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile. The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly


international symposium on computer architecture | 2011

Demand-driven software race detection using hardware performance counters

Joseph L. Greathouse; Zhiqiang Ma; Matthew I. Frank; Ramesh Peri; Todd M. Austin

Dynamic data race detectors are an important mechanism for creating robust parallel programs. Software race detectors instrument the program under test, observe each memory access, and watch for inter-thread data sharing that could lead to concurrency errors. While this method of bug hunting can find races that are normally difficult to observe, it also suffers from high runtime overheads. It is not uncommon for commercial race detectors to experience 300× slowdowns, limiting their usage. This paper presents a hardware-assisted demand-driven race detector. We are able to observe cache events that are indicative of data sharing between threads by taking advantage of hardware available on modern commercial microprocessors. We use these to build a race detector that is only enabled when it is likely that inter-thread data sharing is occurring. When little sharing takes place, this demand-driven analysis is much faster than contemporary continuous-analysis tools without a large loss of detection accuracy. We modified the race detector in Intel® Inspector XE to utilize our hardware-based sharing indicator and were able to achieve performance increases of 3× and 10× in two parallel benchmark suites and 51× for one particular program.


international symposium on performance analysis of systems and software | 2015

Mosaic: cross-platform user-interaction record and replay for the fragmented android ecosystem

Matthew Halpern; Yuhao Zhu; Ramesh Peri; Vijay Janapa Reddi

In contrast to traditional computing systems, such as desktops and servers, that are programmed to perform “compute-bound” and “run-to-completion” tasks, mobile applications are designed for user interactivity. Factoring user interactivity into computer system design and evaluation is important, yet possesses many challenges. In particular, systematically studying interactive mobile applications across the diverse set of mobile devices available today is difficult due to the mobile device fragmentation problem. At the time of writing, there are 18,796 distinct Android mobile devices on the market and will only continue to increase in the future. Differences in screen sizes, resolutions and operating systems impose different interactivity requirements, making it difficult to uniformly study these systems. We present Mosaic, a cross-platform, timing-accurate record and replay tool for Android-based mobile devices. Mosaic overcomes device fragmentation through a novel virtual screen abstraction. User interactions are translated from a physical device into a platform-agnostic intermediate representation before translation to a target system. The intermediate representation is human-readable, which allows Mosaic users to modify previously recorded traces or even synthesize their own user interactive sessions from scratch. We demonstrate that Mosaic allows user interaction traces to be recorded on emulators, smartphones, tablets, and development boards and replayed on other devices. Using Mosaic we were able to replay 45 different Google Play applications across multiple devices, and also show that we can perform cross-platform performance comparisons between two different processors under identical user interactions.


ACM Sigarch Computer Architecture News | 2005

Transparent debugging of dynamically instrumented programs

Naveen Kumar; Ramesh Peri

Dynamic instrumentation systems, used for program analysis, bug isolation, software security and simulations, are becoming increasingly popular. There exists a need to debug dynamically instrumented programs while keeping the presence of dynamic instrumentation system hidden from debug users. Existing debuggers use debug information in program binaries that have been generated by a compiler at static compile time, to provide their debug support. Since dynamic instrumentation systems generate program code at run-time, existing debuggers are not able to provide the same kind of debug support. The most comprehensive existing debug techniques that satisfy this need, used by Tdb, require modification of existing debuggers. This paper provides an OS level approach that silently intercepts the communication between a debugger and the OS and uses a set of debug specifications to provide Tdbs transparent debugging. As a result, any existing debugger can be used to debug dynamically instrumented programs. The proposed techniques have been implemented on x86/Linux platform for the dynamic instrumentation system Pin.


international conference on energy aware computing | 2012

What is eating up battery life on my SmartPhone: A case study

Grace Metri; Abhishek Agrawal; Ramesh Peri; Weisong Shi

Smartphones emerged as the new necessary gadget to many. A smartphone can combine some or all functionalities of several other devices such as a personal computer, phone, personal game console, music player, radio, and/or GPS. Unlike most of the above listed technologies which are switched ON on a need-to basis, a smartphone is always ON. Since a smartphone can run background tasks even during idle mode and since it is limited by its battery life, it becomes necessary to understand what really happens in the background and how it affects the battery life and consequently how to improve it. To this end, we analyzed two smartphone platforms specifically looking at how the energy consumption varies depending on the background applications and network connection type. For instance, we show that you can increase the energy efficiency of an iPhone by up to 59% when streaming music on Wi-Fi as opposed to 3G. Also, when the phone is using 3G, we show that network applications running in the background can reduce the energy efficiency of an iPhone by up to 72% when compared to real idle state. Our observation sheds light on what is eating up the battery life of a smartphone and led us to provide optimization techniques to increase the battery life.


ieee international symposium on workload characterization | 2007

Seekable Compressed Traces

Tipp Moseley; Dirk Grunwald; Ramesh Peri

Program traces are commonly used for purposes such as profiling, processor simulation, and program slicing. Uncompressed, these traces are often too large to exist on disk. Although existing trace compression algorithms achieve high compression rates, they sacrifice the accessibility of uncompressed traces; typical compressed traces must be traversed linearly to reach a desired position in the stream. This paper describes seekable compressed traces that allow arbitrary positioning in the compressed data stream. Furthermore, we enhance existing value prediction based techniques to achieve higher compression rates, particularly for difficult-to-compress traces. Our base algorithm achieves a harmonic mean compression rate for SPEC2000 memory address traces that is 3.47 times better than existing methods. We introduce the concept of seekpoints that enable fast seeking to positions evenly distributed throughout a compressed trace. Adding seekpoints enables rapid sampling and backwards traversal of compressed traces. At a granularity of every 10 M instructions, seekpoints only increase trace sizes by an average factor of 2.65.


workshop on i/o in parallel and distributed systems | 2008

Software development tools for multi-core/parallel programming

Ramesh Peri

The new era of multi-core processors is bringing unprecedented computing power to the mainstream desktop applications. In order to fully exploit this compute power one has to delve into the world of parallel programming which until today has been the exclusive domain of High Performance Computing Community. This talk will focus on the current state of the art in parallel programming tools that is applicable for developers of mainstream parallel applications with emphasis on software development tools like compilers, debuggers, performance analysis tools and correctness checking tools for parallel programs. I will share some of the challenges that developers face today in developing applications for multi-core systems containing a small number of homogeneous cores (2 to 8) and discuss the situation we will face with the advent of systems containing many more heterogeneous cores in next few years.


international conference on parallel architectures and compilation techniques | 2009

Chainsaw: Using Binary Matching for Relative Instruction Mix Comparison

Tipp Moseley; Dirk Grunwald; Ramesh Peri

With advances in hardware, instruction set architectures are undergoing continual evolution. As a result, compilers are under constant pressure to adapt and take full advantage of available features. However, current techniques for evaluating relative compiler performance only compare profiles at the application level, ignoring relative performance differences at finer granularities. To ensure that new features are put to good use, a more rigorous approach is necessary. A fundamental step in tuning compiler performance is identifying the specific examples that can be improved. To solve this problem, we present a compiler-independent binary matching technique to compare executions of differently compiled programs and identify intervals where the behavior can be meaningfully compared. Matched intervals can be automatically analyzed to identify anomalous segments of execution where one version performs significantly differently versus another. We present case studies using Chainsaw to identify significant performance anomalies between differently compiled codes.


international symposium on computer architecture | 2004

Addressing mode driven low power data caches for embedded processors

Ramesh Peri; John Fernando; Ravi Kolagotla

The size and speed of first-level caches and SRAMs of embedded processors continue to increase in response to demands for higher performance. In power-sensitive devices like PDAs and cellular handsets, decreasing power consumption while increasing performance is desirable. Contemporary caches typically exploit locality in memory access patterns but do not exploit locality information encoded in addressing modes used to access memory. We present two schemes that use locality information inherent in memory addressing modes to reduce power consumption of cache or SRAM nearest to the processor. The level-0 data buffer scheme introduces a set of data buffers controlled by the addressing mode to eliminate over a third of all reads to the next level of memory (cache or SRAM). These buffers can also reduce load-use penalty in processors with long load pipelines. The address register tag-buffer scheme exploits the addressing mode to reduce tag array look-up in set associative first-level caches.


computing frontiers | 2007

Identifying potential parallelism via loop-centric profiling

Tipp Moseley; Daniel A. Connors; Dirk Grunwald; Ramesh Peri

Collaboration


Dive into the Ramesh Peri's collaboration.

Top Co-Authors

Avatar

Dirk Grunwald

University of Colorado Boulder

View shared research outputs
Top Co-Authors

Avatar

Tipp Moseley

University of Colorado Boulder

View shared research outputs
Top Co-Authors

Avatar

Grace Metri

Wayne State University

View shared research outputs
Top Co-Authors

Avatar

Naveen Kumar

University of Pittsburgh

View shared research outputs
Top Co-Authors

Avatar

Vijay Janapa Reddi

University of Texas at Austin

View shared research outputs
Top Co-Authors

Avatar

Weisong Shi

Wayne State University

View shared research outputs
Researchain Logo
Decentralizing Knowledge