Sudhakar Yalamanchili

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sudhakar Yalamanchili is active.

Explore More

Publication

Featured researches published by Sudhakar Yalamanchili.

IEEE Computer | 1993

Adaptive routing protocols for hypercube interconnection networks

Patrick T. Gaughan; Sudhakar Yalamanchili

A taxonomy for characterizing adaptive routing protocols for hypercube interconnection networks (HINs) is presented. The taxonomy is based on classes of routing decisions common to any HIN. This taxonomy is used to discuss existing and proposed protocols. Rather than an exhaustive enumeration of related research, the protocols selected for discussion are intended to be representative of the classes defined by the taxonomy. These protocols are candidates for use in massively parallel architectures configured with HINs. To provide some insight into their behavior in very large HINs, results of simulation studies of representative protocols are presented.<<ETX>>

IEEE Transactions on Parallel and Distributed Systems | 1995

A family of fault-tolerant routing protocols for direct multiprocessor networks

Patrick T. Gaughan; Sudhakar Yalamanchili

Our goal is to reconcile the conflicting demands of performance and fault-tolerance in interprocessor communication. To this end, we propose a pipelined communication mechanism-pipelined circuit-switching (PCS)-which is a variant of the well known wormhole routing (WR) mechanism. PCS relaxes some of the routing constraints imposed by WR and as a result enables routing behavior that cannot otherwise be realized. This paper presents a new class of adaptive routing algorithms-misrouting backtracking with m misroutes (MB-m). This class of routing algorithms is made possible by PCS. We provide an analysis of the performance and static fault-tolerant properties of MB-m. The results of an experimental evaluation of PCS and MB-3 are also presented. This methodology provides performance approaching that of WR, while realizing a level of resilience to static faults that is difficult to achieve with WR. >

international conference on parallel architectures and compilation techniques | 2010

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Gregory Frederick Diamos; Andrew Kerr; Sudhakar Yalamanchili; Nathan Clark

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications. This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

high performance distributed computing | 2008

Harmony: an execution model and runtime for heterogeneous many core systems

Gregory Frederick Diamos; Sudhakar Yalamanchili

The emergence of heterogeneous many core architectures presents a unique opportunity for delivering order of magnitude performance increases to high performance applications by matching certain classes of algorithms to specifically tailored architectures. Their ubiquitous adoption, however, has been limited by a lack of programming models and management frameworks designed to reduce the high degree of complexity of software development intrinsic to heterogeneous architectures. This paper proposes Harmony, a runtime supported programming and execution model that provides: (1) semantics for simplifying parallelism management, (2) dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and (3) online monitoring driven performance optimization for heterogeneous many core systems. We are particulably concerned with simplifying development and ensuring binary portability and scalability across system configurations and sizes. Initial results from ongoing development demonstrate the binary compatibility with variable number of cores, as well as dynamic adaptation of schedules to data sets. We present preliminary results of key features for some benchmark applications.

ieee international symposium on workload characterization | 2009

A characterization and analysis of PTX kernels

Andrew Kerr; Gregory Frederick Diamos; Sudhakar Yalamanchili

General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIAs CUDA [1], OpenCL [2], and Intels Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design. This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUCs Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) - a machine model and low level virtual ISA that is representative of ISAs for data parallel execution. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.4 specification [4], and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch reconvergence, the prevalance of sharing between threads, and highlights opportunities for additional parallelism.

Computing in Science and Engineering | 2011

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Jeffrey S. Vetter; Richard Glassbrook; Jack J. Dongarra; Karsten Schwan; Bruce Loftis; Stephen McNally; Jeremy S. Meredith; James H. Rogers; Philip C. Roth; Kyle Spafford; Sudhakar Yalamanchili

The Keeneland projects goal is to develop and deploy an innovative, GPU-based high-performance computing system for the NSF computational science community.

international conference on computer design | 1997

Power constrained design of multiprocessor interconnection networks

Chirag S. Patel; Sek M. Chai; Sudhakar Yalamanchili; David E. Schimmel

The paper considers the power constrained design of orthogonal multiprocessor interconnection networks. The authors present a detailed model of message latency as a function of topology, technology architecture, and power. This model is then used to analyze a number of interesting scenarios, providing a sound engineering basis for interconnection network design in these cases. For example, they have observed that under a fixed power constraint, the network dimension which achieves minimal latency is a slowly growing function of system size. In addition, as they increase the available power per node for a fixed system size, the dimension at which message latency is minimized shifts towards higher dimensional networks.

international symposium on low power electronics and design | 2010

An energy efficient cache design using spin torque transfer (STT) RAM

Mitchelle Rasquinha; Dhruv Choudhary; Subho Chatterjee; Saibal Mukhopadhyay; Sudhakar Yalamanchili

The on-chip memory is a dominant source of power and energy consumption in modern and future processors. This paper explores the use of a new emerging non-volatile memory technology as a replacement for SRAM based lower level caches — Spin Torque Transfer(STT) RAM. While STTRAM achieves a reduction in leakage energy of 90% compared to SRAM, the dynamic energy for a write operation is 2X that of SRAM. Consequently, we propose additional microarchitectural optimizations to reduce overall dynamic energy which achieve an average reduction in dynamic energy over the base case of 30% with a range of 16% to 60% across 10 benchmarks.

international symposium on microarchitecture | 2011

SIMD re-convergence at thread frontiers

Gregory Frederick Diamos; Benjamin Ashbaugh; Subramaniam Maiyuran; Andrew Kerr; Haicheng Wu; Sudhakar Yalamanchili

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the programs control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a Thread Frontier, which is a bounded region of the program containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5 – 633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.

architectural support for programming languages and operating systems | 2010

Modeling GPU-CPU workloads and systems

Andrew Kerr; Gregory Frederick Diamos; Sudhakar Yalamanchili

Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms. This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.

Explore More