Rakesh Krishnaiyer
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rakesh Krishnaiyer.
international symposium on microarchitecture | 2000
Rakesh Krishnaiyer; D. Kulkami; D. Laven; L. Wei; C.-C. Lim; J. Ng; D. Sehr
The IA-64 architectures rich set of features enable aggressive high-level and scalar optimizations-supported by the latest analysis techniques-to improve integer and floating-point performance.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013
Rakesh Krishnaiyer; Emre Kultursay; Pankaj Chawla; Serguei V. Preis; Anatoly Zvezdin; Hideki Saito
The Intel® Xeon Phi™ coprocessor has software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations. In this work, we provide details on compiler-based generation of these instructions and evaluate their impact on the performance of the Intel® Xeon Phi™ coprocessor using a wide range of parallel applications with different characteristics. Our results show that the Intel® Composer XE 2013 compiler can make effective use of these mechanisms to achieve significant performance improvements.
international parallel and distributed processing symposium | 2005
Xinmin Tian; Rakesh Krishnaiyer; Hideki Saito; Milind Girkar; Wei Li
In this paper, we evaluate the benefits achievable from software data-prefetching techniques for OpenMP* C/C++ and Fortran benchmark programs, using the framework of the Intel production compiler for the Intel/spl reg/ Itanium/spl reg/ 2 processor. Prior work on software data-prefetching study has primarily focused on benchmark performance in the context of a few software data-prefetching schemes developed in research compilers. In contrast, our study is to examine the impact of an extensive set of software data-prefetching schemes on the performance of multi-threaded execution using a full set of SPEC OMPM2001 applications with a product compiler on a commercial multiprocessor system. This paper presents performance results showing that compiler-based software data-prefetching supported in the Intel compiler results in significant performance gain, viz., 11.88% to 99.85% gain for 6 out of 11 applications, 3.83% to 6.96% gain for 4 out of 11 applications, with only one application obtaining less than 1% gain on an Intel/spl reg/ Itanium/spl reg/ 2 processor based SGI Altix* 32-way shared-memory multiprocessor system.
compiler construction | 2003
Somnath Ghosh; Abhay S. Kanhere; Rakesh Krishnaiyer; Dattatraya Kulkarni; Wei Li; Chu-Cheow Lim; John L. Ng
The High-Level Optimizer (HLO) is a key part of the compiler technology that enabled Itanium™ and Itanium™2 processors deliver leading floating-point performance at their introduction. In this paper, we discuss the design and implementation experience in integrating diverse optimizations in the HLO module. In particular, we describe decisions made in the design of HLO targeting Itanium processor family. We provide empirical data to validate the design decisions. Since HLO was implemented in a production compiler, we made certain engineering trade-offs. We discuss these trade-offs and outline key learning derived from our experience.
Communications of The ACM | 2015
Nadathur Satish; Changkyu Kim; Jatin Chhugani; Hideki Saito; Rakesh Krishnaiyer; Mikhail Smelyanskiy; Milind Girkar; Pradeep Dubey
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming manycore architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the “Ninja gap”, which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
languages and compilers for parallel computing | 2014
Ashay Rane; Rakesh Krishnaiyer; Chris J. Newburn; James C. Browne; Leonardo Fialho; Zakhar Matveev
Modern compilers execute sophisticated static analyses to enable optimization across a wide spectrum of code patterns. However, there are many cases where even the most sophisticated static analysis is insufficient or where the computation complexity makes complete static analysis impractical. It is often possible in these cases to discover further opportunities for optimization from dynamic profiling and provide this information to the compiler, either by adding directives or pragmas to the source, or by modifying the source algorithm or implementation. For current and emerging generations of chips, vectorization is one of the most important of these optimizations. This paper defines, implements, and applies a systematic process for combining the information acquired by static analysis by modern compilers with information acquired by a targeted, high-resolution, low-overhead dynamic profiling tool to enable additional and more effective vectorization. Opportunities for more effective vectorization are frequent and the performance gains obtained are substantial: we show a geometric mean across several benchmarks of over 1.5x in speedup on the Intel Xeon Phi coprocessor.
international symposium on computer architecture | 2012
Nadathur Satish; Changkyu Kim; Jatin Chhugani; Hideki Saito; Rakesh Krishnaiyer; Mikhail Smelyanskiy; Milind Girkar; Pradeep Dubey
Archive | 2001
Rakesh Krishnaiyer; Somnath Ghosh; Wei Li
compiler construction | 2002
Youfeng Wu; Mauricio J. Serrano; Rakesh Krishnaiyer; Wei Li; Jesse Fang
Archive | 2002
Rakesh Krishnaiyer; Wei Li