Nagavijayalakshmi Vydyanathan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nagavijayalakshmi Vydyanathan is active.

Explore More

Publication

Featured researches published by Nagavijayalakshmi Vydyanathan.

acm sigplan symposium on principles and practice of parallel programming | 2009

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Muthu Manikandan Baskaran; Nagavijayalakshmi Vydyanathan; Uday Bondhugula; J. Ramanujam; Atanas Rountev; P. Sadayappan

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.

IEEE Transactions on Parallel and Distributed Systems | 2009

An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

Nagavijayalakshmi Vydyanathan; Sriram Krishnamoorthy; Gerald Sabin; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two (also called mixed parallelism) has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the task graph, the runtime estimates and scalability characteristics of the tasks, and the intertask data communication volumes. A locality-conscious scheduling strategy is used to improve intertask data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications and synthetic graphs shows that our algorithm consistently generates schedules with a lower makespan as compared to Critical Path Reduction (CPR) and Critical Path and Allocation (CPA), two previously proposed scheduling algorithms. Our algorithm also produces schedules that have a lower makespan than pure task- and data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.

ieee international conference on high performance computing, data, and analytics | 2009

Towards a robust, real-time face processing system using CUDA-enabled GPUs

Bharatkumar Sharma; Rahul Thota; Nagavijayalakshmi Vydyanathan; Amit A. Kale

Processing of human faces finds application in various domains like law enforcement and surveillance, entertainment (interactive video games), information security, smart cards etc. Several of these applications are interactive and require reliable and fast face processing. A generic face processing system may comprise of face detection, recognition, tracking and rendering. In this paper, we develop a GPU accelerated real-time and robust face processing system that does face detection and tracking. Face detection is done by adapting the Viola and Jones algorithm that is based on the Adaboost learning system. For robust tracking of faces across real-life illumination conditions, we leverage the algorithm proposed by Thota and others, that combines the strengths of Adaboost and an image based parametric illumination model. We design and develop optimized parallel implementations of these algorithms on graphics processors using the Compute Unified Device Architecture (CUDA), a C-based programming model from NVIDIA. We evaluate our face processing system using both static image databases as well as using live frames captured from a firewire camera under realistic conditions. Our experimental results indicate that our parallel face detector and tracker achieve much greater detection speeds as compared to existing work, while maintaining accuracy. We also demonstrate that our tracking system is robust to extreme illumination conditions.

cluster computing and the grid | 2005

A hypergraph partitioning based approach for scheduling of tasks with batch-shared I/O

Gaurav Khanna; Nagavijayalakshmi Vydyanathan; Tahsin M. Kurç; Pete Wyckoff; Joel H. Saltz; P. Sadayappan

This paper proposes a novel, hypergraph partitioning based strategy to schedule multiple data analysis tasks with batch-shared I/O behavior. This strategy formulates the sharing of files among tasks as a hypergraph to minimize the I/O overheads due to transferring of the same set of files multiple times and employs a dynamic scheme for file transfers to reduce contention on the storage system. We experimentally evaluate the proposed approach using application emulators from two application domains; analysis of remotely-sensed data and biomedical imaging.

international conference on parallel processing | 2006

An Integrated Approach for Processor Allocation and Scheduling of Mixed-Parallel Applications

Nagavijayalakshmi Vydyanathan; Sriram Krishnamoorthy; Gerald Sabin; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

Computationally complex applications can often be viewed as a collection of coarse-grained data-parallel tasks with precedence constraints. Researchers have shown that combining task and data parallelism (mixed parallelism) can be an effective approach for executing these applications, as compared to pure task or data parallelism. In this paper, we present an approach to determine the appropriate mix of task and data parallelism, i.e., the set of tasks that should be run concurrently and the number of processors to be allocated to each task. An iterative algorithm is proposed that couples processor allocation and scheduling of mixed-parallel applications on compute clusters so as to minimize the parallel completion time (makespan). Our algorithm iteratively reduces the makespan by increasing the degree of data parallelism of tasks on the critical path that have good scalability and a low degree of potential task parallelism. The approach employs a look-ahead technique to escape local minima and uses priority based backfill scheduling to efficiently schedule the parallel tasks onto processors. Evaluation using benchmark task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently performs better than CPR and CPA, two previously proposed scheduling schemes, as well as pure task and data parallelism

european conference on parallel processing | 2007

Toward optimizing latency under throughput constraints for application workflows on clusters

Nagavijayalakshmi Vydyanathan; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

In many application domains, it is desirable to meet some user-defined performance requirement while minimizing resource usage and optimizing additional performance parameters. For example, application workflows with real-time constraints may have strict throughput requirements and desire a low latency or response-time. The structure of these workflows can be represented as directed acyclic graphs of coarse-grained application tasks with data dependences. In this paper, we develop a novel mapping and scheduling algorithm that minimizes the latency of workflows that act on a stream of input data, while satisfying throughput requirements. The algorithm employs pipelined parallelism and intelligent clustering and replication of tasks to meet throughput requirements. Latency is minimized by exploiting task parallelism and reducing communication overheads. Evaluation using synthetic benchmarks and application task graphs shows that our algorithm 1) consistently meets throughput requirements even when other existing schemes fail, 2) produces lower-latency schedules, and 3) results in lesser resource usage.

international conference on parallel processing | 2008

A Duplication Based Algorithm for Optimizing Latency Under Throughput Constraints for Streaming Workflows

Nagavijayalakshmi Vydyanathan; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

Scheduling, in many application domains, involves the optimization of multiple performance metrics. For example, application workflows with real-time constraints have strict throughput requirements and also desire a low latency or response time. In this paper, we present a novel algorithm for the scheduling of workflows that act on a stream of input data. Our algorithm focuses on the two performance metrics: latency and throughput, and minimizes the latency of workflows while satisfying strict throughput requirements. We leverage pipelined, task and data parallelism in a coordinated manner to meet these objectives and investigate the benefit of task duplication in alleviating communication overheads in the pipelined schedule for different workflow characteristics. The proposed algorithm is designed for a realistic k-port communication model, where each processor can simultaneously communicate with at most k distinct processors. Evaluation using synthetic and application benchmarks shows that our algorithm consistently produces lower-latency schedules and meets throughput requirements, even when previously proposed schemes fail.

high performance distributed computing | 2006

Task Scheduling and File Replication for Data-Intensive Jobs with Batch-shared I/O

Gaurav Khanna; Nagavijayalakshmi Vydyanathan; Tahsin M. Kurç; Sriram Krishnamoorthy; P. Sadayappan; Joel H. Saltz

This paper addresses the problem of efficient execution of a batch of data-intensive tasks with batch-shared I/O behavior, on coupled storage and compute clusters. Two scheduling schemes are proposed: 1) a 0-1 integer programming (IP) based approach, which couples task scheduling and data replication, and 2) a bi-level hypergraph partitioning based heuristic approach (BiPartition), which decouples task scheduling and data replication. The experimental results show that: 1) the IP scheme achieves the best batch execution time, but has significant scheduling overhead, thereby restricting its application to small scale workloads, and 2) the BiPartition scheme is a better fit for larger workloads and systems - it has very low scheduling overhead and no more than 5-10% degradation in solution quality, when compared with the IP based approach

grid computing | 2012

An evaluation of CUDA-enabled virtualization solutions

M S Vinaya; Nagavijayalakshmi Vydyanathan; Mrugesh Gajjar

Virtualization, as a technology that enables easy and effective resource sharing with a low cost and energy footprint, is becoming increasingly popular not only in enterprises but also in high performance computing. Applications with stringent performance needs often make use of graphics processors for accelerating their computations. Hence virtualization solutions that support GPU acceleration are gaining importance. This paper performs a detailed evaluation of three frameworks: rCUDA, gVirtuS and Xen, which support GPU acceleration through CUDA, within a virtual machine. We describe the architectures of these three solutions and compare and contrast them in terms of their fidelity, performance, multiplexing and interposition characteristics.

international conference on cluster computing | 2006

Locality Conscious Processor Allocation and Scheduling for Mixed Parallel Applications

Nagavijayalakshmi Vydyanathan; Sriram Krishnamoorthy; Gerald Sabin; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

Complex applications can often be viewed as a collection of coarse-grained data-parallel application components with precedence constraints. It has been shown that combining task and data parallelism (mixed parallelism) can be an effective execution paradigm for these applications. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism based on the scalability characteristics of the tasks as well as the intertask data communication costs, such that the parallel completion time (makespan) is minimized. The algorithm iteratively reduces the makespan by increasing the degree of data parallelism of tasks on the critical path that have good scalability and a low degree of potential task parallelism. Data communication costs along the critical path are minimized by exploiting parallel transfer mechanisms and use of a locality conscious backfill scheduler. Evaluation using benchmark task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently performs better than previous scheduling schemes

Explore More