Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Vikram K. Narayana is active.

Publication


Featured researches published by Vikram K. Narayana.


international conference on parallel processing | 2011

GPU Resource Sharing and Virtualization on High Performance Computing Systems

Teng Li; Vikram K. Narayana; Esam El-Araby; Tarek A. El-Ghazawi

Modern Graphic Processing Units (GPUs) are widely used as application accelerators in the High Performance Computing (HPC) field due to their massive floating-point computational capabilities and highly data-parallel computing architecture. Contemporary high performance computers equipped with co-processors such as GPUs primarily execute parallel applications using the Single Program Multiple Data (SPMD) model, which requires balanced computing resources of both microprocessor and co-processors to ensure full system utilization. While the inclusion of GPUs in HPC systems provides more computing resources and significant performance improvements, the asymmetrical distribution of the number of GPUs relative to the microprocessors can result in an underutilization of overall system computing resources. In this paper, we propose a GPU resource virtualization approach to allow underutilized microprocessors to share the GPUs. We analyze factors affecting the parallel execution performance on GPUs and conduct a theoretical performance estimation based on the most recent GPU architectures as well as the SPMD model. Then we present the implementation details of the virtualization infrastructure, followed by an experimental verification of the proposed concepts using an NVIDIA Fermi GPU computing node. The results demonstrate a considerable performance gain over the traditional SPMD execution without virtualization. Furthermore, the proposed solution enables full utilization of the asymmetrical system resources, through the sharing of the GPUs among microprocessors, while incurring low overheads due to the virtualization layer.


ACM Transactions on Reconfigurable Technology and Systems | 2010

Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing

Miaoqing Huang; Vikram K. Narayana; Harald Simmler; Olivier Serres; Tarek A. El-Ghazawi

High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. When the hardware tasks of an application cannot simultaneously fit in an FPGA, the task graph needs to be partitioned and scheduled into multiple FPGA configurations, in a way that minimizes the total execution time. This article proposes the Reduced Data Movement Scheduling (RDMS) algorithm that aims to improve the overall performance of hardware tasks by taking into account the reconfiguration time, data dependency between tasks, intertask communication as well as task resource utilization. The proposed algorithm uses the dynamic programming method. A mathematical analysis of the algorithm shows that the execution time would at most exceed the optimal solution by a factor of around 1.6, in the worst-case. Simulations on randomly generated task graphs indicate that RDMS algorithm can reduce interconfiguration communication time by 11% and 44% respectively, compared with two other approaches that consider data dependency and hardware resource utilization only. The practicality, as well as efficiency of the proposed algorithm over other approaches, is demonstrated by simulating a task graph from a real-life application - N-body simulation - along with constraints for bandwidth and FPGA parameters from existing high-performance reconfigurable computers. Experiments on SRC-6 are carried out to validate the approach.


IEEE Photonics Journal | 2015

The Case for Hybrid Photonic Plasmonic Interconnects (HyPPIs): Low-Latency Energy-and-Area-Efficient On-Chip Interconnects

Shuai Sun; Abdel-Hameed A. Badawy; Vikram K. Narayana; Tarek A. El-Ghazawi; Volker J. Sorger

Moores law for traditional electric integrated circuits is facing increasingly more challenges in both physics and economics. Among those challenges is the fact that the bandwidth per compute on the chip is dropping, whereas the energy needed for data movement keeps rising. We benchmark various interconnect technologies, including electrical, photonic, and plasmonic options. We contrast them with hybrid photonic-plasmonic interconnect(s) [HyPPI(s)], where we consider plasmonics for active manipulation devices and photonics for passive propagation integrated circuit elements and further propose another novel hybrid link that utilizes an on-chip laser for intrinsic modulation, thus bypassing electrooptic modulation. Our analysis shows that such hybridization will overcome the shortcomings of both pure photonic and plasmonic links. Furthermore, it shows superiority in a variety of performance parameters such as point-to-point latency, energy efficiency, throughput, energy delay product, crosstalk coupling length, and bit flow density, which is a new metric that we defined to reveal the tradeoff between the footprint and performance. Our proposed HyPPIs show significantly superior performance compared with other links.


computing frontiers | 2011

Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Lingyuan Wang; Miaoqing Huang; Vikram K. Narayana; Tarek A. El-Ghazawi

Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential. In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.


IEEE Transactions on Computers | 2012

Efficient Mapping of Task Graphs onto Reconfigurable Hardware Using Architectural Variants

Miaoqing Huang; Vikram K. Narayana; Mohamed Bakhouya; Jaafar Gaber; Tarek A. El-Ghazawi

High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. Mapping application task graphs onto reconfigurable hardware is, therefore, of rising attention. In this work, we approach the mapping problem by incorporating multiple architectural variants for each hardware task; the variants reflect tradeoffs between the logic resources consumed and the task execution throughput. We propose a mapping approach based on the genetic algorithm, and show its effectiveness for random task graphs as well as an N-body simulation application, demonstrating improvements of up to 78.6 percent in the execution time compared with choosing a fixed implementation variant for all tasks. We then validate our methodology through experiments on real hardware, an SRC-6 reconfigurable computer.


international conference on parallel and distributed systems | 2011

A Static Task Scheduling Framework for Independent Tasks Accelerated Using a Shared Graphics Processing Unit

Teng Li; Vikram K. Narayana; Tarek A. El-Ghazawi

The High Performance Computing (HPC) field is witnessing the increasing use of Graphics Processing Units (GPUs) as application accelerators, due to their massively data-parallel computing architectures and exceptional floating-point computational capabilities. The performance advantage from GPU-based acceleration is primarily derived for GPU computational kernels that operate on large amount of data, consuming all of the available GPU resources. For applications that consist of several independent computational tasks that do not occupy the entire GPU, sequentially using the GPU one task at a time leads to performance inefficiencies. It is therefore important for the programmer to cluster small tasks together for sharing the GPU, however, the best performance cannot be achieved through an ad-hoc grouping and execution of these tasks. In this paper, we explore the problem of GPU tasks scheduling, to allow multiple tasks to efficiently share and be executed in parallel on the GPU. We analyze factors affecting multi-tasking parallelism and performance, followed by developing the multi-tasking execution model as a performance prediction approach. The model is validated by comparing with actual execution scenarios for GPU sharing. We then present the scheduling technique and algorithm based on the proposed model, followed by experimental verifications of the proposed approach using an NVIDIA Fermi GPU computing node. Our results demonstrate significant performance improvements using the proposed scheduling approach, compared with sequential execution of the tasks under the conventional multi-tasking execution scenario.


computing frontiers | 2010

Efficient cache design for solid-state drives

Miaoqing Huang; Olivier Serres; Vikram K. Narayana; Tarek A. El-Ghazawi; Gregory B. Newby

Solid-State Drives (SSDs) are data storage devices that use solid-state memory to store persistent data. Flash memory is the de facto nonvolatile technology used in most SSDs. It is well known that the writing performance of flash-based SSDs is much lower than the reading performance due to the fact that a flash page can be written only after it is erased. In this work, we present an SSD cache architecture designed to provide a balanced read/write performance for flash memory. An efficient automatic updating technique is proposed to provide a more responsive SSD architecture by writing back stable but dirty flash pages according to a predetermined set of policies during the SSD device idle time. Those automatic updating policies are also tested and compared. Simulation results demonstrate that both reading and writing performance are improved significantly by incorporating the proposed cache with automatic updating feature into SSDs.


computing frontiers | 2014

Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations

Teng Li; Vikram K. Narayana; Tarek A. El-Ghazawi

The incorporation of GPUs as co-processors has brought forth significant performance improvements for High-Performance Computing (HPC). Efficient utilization of the GPU resources is thus an important consideration for computer scientists. In order to obtain the required performance while limiting the energy consumption, researchers and vendors alike are seeking to apply traditional CPU approaches into the GPU computing domain. For instance, newer NVIDIA GPUs now support concurrent execution of independent kernels as well as Dynamic Voltage and Frequency Scaling (DVFS). Amidst these new developments, we are faced with new opportunities for efficiently scheduling GPU computational kernels under performance and energy constraints. In this paper, we carry out performance and energy optimizations geared towards the execution phases of concurrent kernels in GPU-based computing. When multiple GPU kernels are enqueued for concurrent execution, the sequence in which they are initiated can significantly affect the total execution time and the energy consumption. We attribute this behavior to the relative synergy among kernels that are launched within close proximity of each other. Accordingly, we define metrics for computing the extent to which kernels are symbiotic, by modeling their complementary resource requirements and execution characteristics. We then propose a symbiotic scheduling algorithm to obtain the best possible kernel launch sequence for concurrent execution. Experimental results on the latest NVIDIA K20 GPU demonstrate the efficacy of our proposed algorithm-based approach, by showing near-optimal results within the solution space of both performance and energy consumption. As our further experimental study on DVFS finds that increasing the GPU frequency in general leads to improved performance and energy saving, the proposed approach reduces the necessity for over-clocking and can be readily adopted by programmers with minimal programming effort and risk.


reconfigurable computing and fpgas | 2011

An Architecture for Reconfigurable Multi-core Explorations

Olivier Serres; Vikram K. Narayana; Tarek A. El-Ghazawi

Multi-core systems are now the norm, and reconfigurable systems have shown substantial benefits over general purpose ones. This paper presents a combination of the two: a fully featured reconfigurable multi-core processor based on the Leon3 processor. The platform has important features like cache coherency, a fully running modern OS (GNU/Linux) and each core has a tightly coupled reconfigurable coprocessor unit attached. This allows the SPARC instruction set to be extended for the running application. The multi-core reconfigurable processor architecture, including the coprocessor interface, the ICAP controller and the Linux kernel driver, is presented. The experimental results show the characteristics of the platform including: area costs, the memory contention, the reprogramming cost... Speedups up to 100x are demonstrated on a cryptography test.


field-programmable custom computing machines | 2009

Efficient Mapping of Hardware Tasks on Reconfigurable Computers Using Libraries of Architecture Variants

Miaoqing Huang; Vikram K. Narayana; Tarek A. El-Ghazawi

Scheduling and partitioning of task graphs on reconfigurable hardware needs to be carefully carried out in order to achieve the best possible performance. In this paper, we demonstrate that a significant improvement to the total execution time is possible by incorporating a library of hardware task implementations, which contains multiple architectural variants for each hardware task reflecting tradeoffs between the resources utilization and the task execution throughput. We develop a genetic algorithm based mapping approach, which considers both task graph and target platform, and present results for an N-body simulation application using estimated numbers for resource utilization for the constituent tasks and based on actual architectural constraints from different reconfigurable platforms. The results demonstrate improvements of up to 85.3% in the execution time, compared to choosing a fixed implementation variant for each task while keeping a reasonable searching time.

Collaboration


Dive into the Vikram K. Narayana's collaboration.

Top Co-Authors

Avatar

Tarek A. El-Ghazawi

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Volker J. Sorger

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Shuai Sun

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Teng Li

George Washington University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jiaxin Peng

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Armin Mehrabian

George Washington University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Olivier Serres

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Ruoyu Zhang

George Washington University

View shared research outputs
Researchain Logo
Decentralizing Knowledge