Vijay S. Pai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vijay S. Pai is active.

Explore More

Publication

Featured researches published by Vijay S. Pai.

acm symposium on parallel algorithms and architectures | 1997

Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Parthasarathy Ranganathan; Vijay S. Pai; Sarita V. Adve

This paper studies techniques to improue the performance of memory consistency models for shared-memory multiprocessors with ILP processors. The first part of this paper extends earlier work by studying the impact of current hardware optimization to memory consistency implementations, hardware-controlled non-binding prefetching and speculative load execution, on the performance of the processor consistency (PC) memory model. We find that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC. Nevertheless, release consistency (RC) provides significant benefits over PC in some cases, because PC’ suffers from the negative ef7ects of premature store prefetches and insufficient memory queue sizes. The second part of the paper proposes and evaluates a new technique, speculative retirement, to improve the performance of SC. Speculative retirement alleviates the impact of the store-to-load constraint of SC by allowing loads and subsequent instructions to speculatively commit or retire, even while a previous store is outstanding. Speculative retirement needs additional hardware support (in the form of a history bu~er) to recover from possible consistency violations due to such speculative retires. With a 64 element history bufler, speculative retirement reduces the execution time gap between SC and PC to within 11% for ail our applications on our base architecture; a significant, though reduced, gap still remains between SC and RC. The third part of our paper evaluates the interactions of the various techniques with larger instruction window sizes. When increasing instruction window size, initially, the previous best implementations of all models generally improve in performance due to increased load and store overlap. With further increases, the performance of PC and RC stabilizes while that of SC often degrades (due to negative eflects of “This work is supported in part by the National Science Foundation under Grant NcJ. CCR-9410457, CCR-9502500, and CDA-9502791, and the Texas Advanced Technology Program under Grant No. 003604016. Vijay S. Pai is also supported by a Fannie and John Hertz Foundation Fellowship. Permission to make digit: d/lmrdcopies of all or IMUIot’this nmteria I Iiir personal or classroom use is granted without fee provided IIUNW copies are not made or distributed I’orprofit or commercial advantage. the wspvright notice, the title of the publication and its dak appesr, and nuticx w given tlmt copyright is by permission of the ACM. IIW.TO copy Aerwiw, 10 republish. 10 post on servers or 10 redistribute 10 IisLs requires <pccilic permission antior fee .V’A4 97 Newport, Rhode Iskmd I.ISA Copyright 1997 ACM 0-89791 -890-8/97/06 ..

acm multimedia | 2007

Improving VoD server efficiency with bittorrent

Yung Ryn Choe; Derek L. Schuff; Jagadeesh M. Dyaberi; Vijay S. Pai

3.50 previous optimizations), widening the gap between the models. At low base instruction window sizes, speculative retirement is sometimes outperformed by an equivalent increase in instruction window size (becausethe latter also provides load overlap). However, beyond the point where RC stabilizes, speculative retirement gives comparable or better benefit than an equivalent instruction window increase, with possibly less complexity.

international symposium on computer architecture | 1998

Analytic evaluation of shared-memory systems with ILP processors

Daniel J. Sorin; Vijay S. Pai; Sarita V. Adve; Mary K. Vernon; David A. Wood

This paper presents and evaluates Toast, a scalable Video-on-Demand (VoD)streaming system that combines the popular BitTorrent peer-to-peer (P2P)file-transfer technology with a simple dedicated streaming server to decrease server load and increase client transfer speed. Toast includes a modified version of BitTorrent that supports streaming data delivery and that communicates with a VoD server when the desired data cannot be delivered in real-time by other peers. The results show that the default BitTorrent download strategy is not well-suited to the VoD environment because it fetches pieces of the desired video from other peers without regard to when those pieces will actually be needed by the media viewer. Instead, strategies should favor downloading pieces of content that will be needed earlier, decreasing the chances that the clients will be forced to get the data directly from the VoD server. Such strategies allow Toast to operate much more efficiently than simple unicast distribution, reducing data transfer demands by up to 70-90% if clients remain in the system as seeds after viewing their content. Toast thus extends the aggregate throughput capability of a VoD service, offloading work from the server onto the P2P network in a scalable and demand-driven fashion.

IEEE Computer | 2003

Challenges in computer architecture evaluation

Kevin Skadron; Margaret Martonosi; David I. August; Mark D. Hill; David J. Lilja; Vijay S. Pai

This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds.The model input parameters characterize the ability of an application to exploit instruction-level parallelism as well as the interaction between the application and the memory system architecture. A trace-driven simulation methodology is developed that allows these parameters to be generated over 100 times faster than with a detailed execution-driven simulator.Finally, this paper shows that the analytical model can be used to gain insights into application performance and to evaluate architectural design trade-offs.

architectural support for programming languages and operating systems | 1996

An evaluation of memory consistency models for shared-memory systems with ILP processors

Vijay S. Pai; Parthasarathy Ranganathan; Sarita V. Adve; Tracy Harton

We focus on problems suited to the current evaluation infrastructure. The current limitation and trends in evaluation techniques are troublesome and could noticeably slow the rate of computer system innovation. New research has been recommended to help and make quantitative evaluations of computer systems manageable. We support research in the areas of simulation frameworks, benchmarking methodologies, analytic methods, and validation techniques.

international conference on parallel architectures and compilation techniques | 2010

Accelerating multicore reuse distance analysis with sampling and parallelization

Derek L. Schuff; Milind Kulkarni; Vijay S. Pai

Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. Researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative loads, have the potential to equalize the hardware performance of memory consistency models on such processors.This paper performs the first detailed quantitative comparison of several implementations of sequential consistency and release consistency optimized for aggressive ILP processors. Our results indicate that hardware prefetching and speculative loads dramatically improve the performance of sequential consistency. However, the gap between sequential consistency and release consistency depends on the cache write policy and the complexity of the cache-coherence protocol implementation. In most cases, release consistency significantly outperforms sequential consistency, but for two applications, the use of a write-back primary cache and a more complex cache-coherence protocol nearly equalizes the performance of the two models.We also observe that the existing techniques, which require on-chip hardware modifications, enhance the performance of release consistency only to a small extent. We propose two new software techniques --- fuzzy acquires and selective acquires --- to achieve more overlap than allowed by the previous implementations of release consistency. To enhance methods for overlapping acquires, we also propose a technique to eliminate control dependences caused by an acquire loop, using a small amount of off-chip hardware called the synchronization buffer.

high-performance computer architecture | 1997

The impact of instruction-level parallelism on multiprocessor performance and simulation methodology

Vijay S. Pai; Parthasarathy Ranganathan; Sarita V. Adve

Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance proiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.

high-performance computer architecture | 1999

Improving the accuracy vs. speed tradeoff for simulating shared-memory multiprocessors with ILP processors

Murthy Durbhakula; Vijay S. Pai; Sarita V. Adve

Current microprocessors exploit high levels of instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. This paper presents the first detailed analysis of the impact of such processors on shared-memory multiprocessors using a detailed execution-driven simulator. Using this analysis, we also examine the validity of common direct-execution simulation techniques that employ previous-generation processor models to approximate ILP-based multiprocessors. We find that ILP techniques substantially reduce CPU time in multiprocessors, but are less effective in reducing memory stall time. Consequently, despite the presence of inherent latency-tolerating techniques in ILP processors, memory stall time becomes a larger component of execution time and parallel efficiencies are generally poorer in ILP-based multiprocessors than in previous-generation multiprocessors. Examining the validity of direct-execution simulators with previous-generation processor models, we find that, with appropriate approximations, such simulators can reasonably characterize the behavior of applications with poor overlap of read misses. However, they can be highly inaccurate for applications with high overlap of read misses. For our applications, the errors in execution time with these simulators range from 26% to 192% for the most commonly used model, and from -8% to 73% for the most accurate model.

high-performance computer architecture | 2005

An efficient programmable 10 gigabit Ethernet network interface card

Paul Willmann; Hyong-youb Kim; Scott Rixner; Vijay S. Pai

Previous simulators for shared-memory architectures have imposed a large tradeoff between simulation accuracy and speed. Most such simulators model simple processors that do not exploit common instruction-level parallelism (ILP) features, consequently exhibiting large errors when used to model current systems. A few newer simulators model current ILP processors in detail, but we find them to be about ten times slower. We propose a new simulation technique, based on a novel adaptation of direct execution, that alleviates this accuracy vs. speed tradeoff. We compare the speed and accuracy of our new simulator, DirectRSIM, with three other simulators-RSIM (a detailed simulator for multiprocessors with ILP processors) and two representative simple-processor based simulators. Compared to RSIM, on average, DirectRSIM is 3.6 times faster and exhibits a relative error of only 1.3% in total execution time. Compared to the simple-processor based simulators, DirectRSIM is far superior in accuracy, and yet is only 2.7 times slower.

architectural support for programming languages and operating systems | 2002

Increasing web server throughput with network interface data caching

Hyong-youb Kim; Vijay S. Pai; Scott Rixner

This paper explores the hardware and software mechanisms necessary for an efficient programmable 10 Gigabit Ethernet network interface card. Network interface processing requires support for the following characteristics: a large volume of frame data, frequently accessed frame metadata, and high frame rate processing. This paper proposes three mechanisms to improve programmable network interface efficiency. First, a partitioned memory organization enables low-latency access to control data and high-bandwidth access to frame contents from a high-capacity memory. Second, a distributed task-queue mechanism enables parallelization of frame processing across many low-frequency cores, while using software to maintain total frame ordering. Finally, the addition of two new atomic read-modify-write instructions reduces frame ordering overheads by 50%. Combining these hardware and software mechanisms enables a network interface card to saturate a full-duplex 10 Gb/s Ethernet link by utilizing 6 processor cores and 4 banks of on-chip SRAM operating at 166 MHz, along with external 500 MHz GDDR SDRAM.

Explore More