Ramesh Radhakrishnan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ramesh Radhakrishnan is active.

Explore More

Publication

Featured researches published by Ramesh Radhakrishnan.

international symposium on microarchitecture | 1998

Evaluating MMX technology using DSP and multimedia applications

Ravi Bhargava; Lizy Kurian John; Brian L. Evans; Ramesh Radhakrishnan

Many current general purpose processors are using extensions to the instruction set architecture to enhance the performance of digital signal processing (DSP) and multimedia applications. In this paper, we evaluate the X86 architectures multimedia extension (MMX) instruction set on a set of benchmarks. Our benchmark suite includes kernels (filtering, fast Fourier transforms, and vector arithmetic) and applications (JPEG compression, Doppler radar processing, imaging, and G.722 speech encoding). Each benchmark has at least one non-MMX version in C and an MMX version that makes calls to an MMX assembly library. The versions differ in the implementation of filtering, vector arithmetic, and other relevant kernels. The observed speed up for the MMX versions of the suite ranges from less than 1.0 to 6.1. In addition to quantifying the speedup, we perform detailed instruction level profiling using Intels VTune profiling tool. Using VTune, we profile static and dynamic instructions, microarchitecture operations, and data references to isolate the specific reasons for speedup or lack thereof. This analysis allows one to understand which aspects of native signal processing instruction sets are most useful, the current limitations, and how they can be utilized most efficiently.

IEEE Transactions on Computers | 2001

Java runtime systems: characterization and architectural implications

Ramesh Radhakrishnan; Narayanan Vijaykrishnan; Lizy Kurian John; Anand Sivasubramaniam; Juan Rubio; Jyotsna Sabarinathan

The Java Virtual Machine (JVM) is the cornerstone of Java technology and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology. Interpretation, Just-in-Time (JIT) compilation, and hardware realization are well-known solutions for a JVM and previous research has proposed optimizations for each of these techniques. However, each technique has its pros and cons and may not be uniformly attractive for all hardware platforms. Instead, an understanding of the architectural implications of JVM implementations with real applications can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms. Toward this goal, this paper examines architectural issues from both the hardware and JVM implementation perspectives. The paper starts by identifying the important execution characteristics of Java applications from a bytecode perspective. It then explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs and investigates the CPU and cache architectural support that would benefit JVM implementations. We also study the available parallelism during the different execution modes using applications from the SPECjvm98 benchmarks. At the bytecode level, it is observed that less than 5 out of the 256 bytecodes constitute 90 percent of the dynamic bytecode stream. Method sizes fall into a trinodal distribution with peak of 1, 9, and 26 bytecodes across all benchmarks. The architectural issues explored in this study show that, when Java applications are executed with a JIT compiler, selective translation using good heuristics can improve performance, but the saving is only 10-15 percent at best. The instruction and data cache performance of Java applications are seen to be better than that of C/C/sub +/+ applications except in the case of data cache performance in the JIT mode. Write misses resulting from installation of JIT compiler output dominate the misses and deteriorate the data cache performance in JIT mode. A study on the available parallelism shows that Java programs executed using JIT compilers have parallelism comparable to C/C++ programs for small window sizes, but falls behind when the window size is increased. Java programs executed using the interpreter have very little parallelism due to the stack nature of the SVM instruction set, which is dominant in the interpreted execution mode. In addition, this work gives revealing insights and architectural proposals for designing an efficient Java runtime system.

high performance computer architecture | 2000

Architectural issues in Java runtime systems

Ramesh Radhakrishnan; Narayanan Vijaykrishnan; Lizy Kurian John; Anand Sivasubramaniam

The Java Virtual Machine (JVM) is the cornerstone of Java technology, and its efficiency in executing portable Java bytecodes is crucial for the success of this technology. Interpretation, just-in-time (JIT) compilation and hardware realization are well-known solutions for JVM, and previous research has proposed optimizations for each of these techniques. However, each technique has its pros and cons and may not be uniformly attractive for all hardware platforms. Instead, an understanding of the architectural implications of JVM implementations with real applications can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms (from resource-rich servers to resource-constrained hand-held/embedded systems). Towards this goal, this paper examines architectural issues, from both the hardware and JVM implementation perspectives. It specifically explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs, investigates the CPU and cache architectural support that would benefit JVM implementations, and examines the synchronization support for enhancing performance, using applications from the SpecJVM98 benchmarks.

international conference on computer design | 1999

Characterization of Java applications at bytecode and ultra-SPARC machine code levels

Ramesh Radhakrishnan; Juan Rubio; Lizy Kurian John

The paper identifies some of the most important execution characteristics of a recent suite of Java benchmarks (SPEC JVM98) from a bytecode perspective and while running in an interpreted environment on the Sun Ultra SPARC-II. We instrumented the Java Virtual Machine (JVM) to obtain detailed traces and developed a Java bytecode analyzer environment called Jaba to characterize the applications at the bytecode level. Utilizing Jaba and SPARC profiling tools, we analyze bytecode locality, instruction mix and dynamic method sizes. It is observed that less than 45 out of the 250 Java bytecodes constitute 90% of the bytecode stream. A tri-nodal distribution with peaks of 1, 10 and 27 bytecodes is observed for method size across all benchmarks in the JVM98 suite. For most of the applications, one bytecode is seen to translate into approximately 25 SPARC instructions.

international conference on supercomputing | 2001

Improving Java performance using hardware translation

Ramesh Radhakrishnan; Ravi Bhargava; Lizy Kurian John

State of the art Java Virtual Machines with Just-In-Time (JIT) compilers make use of advanced compiler techniques, run-time profiling and adaptive compilation to improve performance. However, these techniques for alleviating performance bottlenecks are more effective in long running workloads, such as server applications. Short running Java programs, or client workloads, spend a large fraction of their execution time in compilation instead of useful execution when run using JIT compilers. In short running Java programs, the benefits of runtime translation do not compensate for the overhead. We propose using hardware support to perform efficient Java translation coupled with a light weight run time environment. The additional hardware performs the translation of Java bytecodes to native code, thus eliminating much of the overhead of software translation. A translate d code buffer is used to hold the translated code, enabling reuse at the byte code level. The proposed hardware can be used in any general purpose processor without degrading performance of native code. The proposed technique is extremely effective for short running client workloads. A performance improvement of 2.8 times to 7.7 times over a software interpreter is achieved. When compared to a JIT compiler all SPECjvm98 benchmarks except one show a performance improvement ranging from. 2.7 times to 5.0 times. A performance degradation (0.58 times) is observed for one benchmark which is long running. Allowing hardware translation to perform optimizations similar to JIT compilers and Java processors will execute long running programs more efficiently and provide speedups similar to that of client workloads.

ieee international conference on high performance computing data and analytics | 1998

Execution characteristics of object oriented programs on the UltraSPARC-II

Ramesh Radhakrishnan; Lizy Kurian John

It is widely accepted that object-oriented design improves code reusability, facilitates code maintainability and enables higher levels of abstraction. Although software developers and the software engineering community have embraced object-oriented programming for these benefits, there have been wide concerns about the performance overhead associated with this programming paradigm on modern processors. We characterize the performance of several C and C++ benchmarks on an UltraSPARC-II processor. Various architectural data related to execution behavior of the benchmarks are collected using on-chip performance monitoring counters. Factors including CPI, instruction and data cache misses, processor stalls due to instruction cache misses and branch misprediction, from real execution of several programs are measured and presented. While previous research evaluates the behavioral differences between C and C++ programs based on profiling and simulation, we measure execution behavior. Results show that the programs in the C++ suite incur a higher CPI, higher i-cache misses, and higher branch mispredictions than the programs in the C suite. A strong correlation was observed between CPI and branch mispredictions for the C++ application programs.

european conference on parallel processing | 1999

A Performance Study of Modern Web Server Applications

Ramesh Radhakrishnan; Lizy Kurian John

Web server programs are one of the most popular computer applications in existence today. Our goal is to study the behavior of modern Web server application programs to understand how they interact with the underlying Web server, hardware and operating system environment. We monitor and evaluate the performance of a Sun UltraSPARC system using hardware performance counters for different workloads. Our workloads include static requests (for HTML files and images) as well as dynamic requests in the form of CGI (Common Gateway Interface) scripts and Servlets. Our studies show that the dynamic workloads have CPIs (Cycles Per Instruction) approximately 20% higher than the static workloads. The major factors that we could attribute to this were higher instruction and data cache miss rates compared to the static workloads, and high external (L2) cache misses.

international symposium on parallel and distributed processing and applications | 2004

Evaluating performance of BLAST on intel xeon and itanium2 processors

Ramesh Radhakrishnan; Rizwan Ali; Garima Kochhar; Kalyana Chadalavada; Ramesh Rajagopalan; Jenwei Hsieh; Onur Celebioglu

High-performance computing has increasingly adopted the use of clustered Intel architecture-based servers. This increase in adoption has been largely fueled by a number of technological enhancements in the Intel architecture-based servers, primarily due to substantial improvement in the Intel processor and memory technology over the past few years. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance impact of the different features associated with each processor/platform technology when running the BLAST workload.

Archive | 2002

Improving Java Performance in Embedded and General-Purpose Processors

Ramesh Radhakrishnan; Lizy Kurian John; Ravi Bhargava; Deepu Talla

Java is used to implement embedded and network computing applications such as Internet TV’s, set-top boxes, smart phones, etc as well as client and server applications. In this chapter, we propose microarchitectural techniques to improve the performance of Java applications executing on embedded Java processors and general-purpose processors. We propose the use of a fill unit that stores decoded bytecodes into a decoded bytecode cache to improve the performance of embedded Java processors. This mechanism improves the fetch and decode bandwidth of Java processors by 2 to 3 times. Out-of-order ILP exploitation is not investigated due to the prohibitive cost, but in-order dual-issue with a 64 entry decoded bytecode cache is seen to result in 10% to 14% improvement in execution cycles.

Archive | 2000