Vaibhav Saxena | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vaibhav Saxena is active.

Explore More

Publication

Featured researches published by Vaibhav Saxena.

international parallel and distributed processing symposium | 2011

Multifrontal Factorization of Sparse SPD Matrices on GPUs

Thomas George; Vaibhav Saxena; Anshul Gupta; Amik Singh; Anamitra R. Choudhury

Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.

european conference on parallel processing | 2010

A Parallel GPU algorithm for mutual information based 3D nonrigid image registration

Vaibhav Saxena; Jonathan Rohrer; Leiguang Gong

Many applications in biomedical image analysis require alignment or fusion of images acquired with different devices or at different times. Image registration geometrically aligns images allowing their fusion. Nonrigid techniques are usually required when the images contain anatomical structures of soft tissue. Nonrigid registration algorithms are very time consuming and can take hours for aligning a pair of 3D medical images on commodity workstation PCs. In this paper, we present parallel design and implementation of 3D non-rigid image registration for the Graphics Processing Units (GPUs). Existing GPU-based registration implementations are mainly limited to intra-modality registration problems. Our algorithm uses mutual information as the similarity metric and can process images of different modalities. The proposed design takes advantage of highly parallel and multi-threaded architecture of GPU containing large number of processing cores. The paper presents optimization techniques to effectively utilize high memory bandwidth provided by GPU using on-chip shared memory and co-operative memory update by multiple threads. Our results with optimized GPU implementation showed an average performance of 2.46 microseconds per voxel and achieved factor of 28 speedup over a CPU-based serial implementation. This improves the usability of nonrigid registration for some real world clinical applications and enables new ones, especially within intra-operative scenarios, where strict timing constraints apply.

ieee international conference on high performance computing, data, and analytics | 2010

Performance evaluation and optimization of random memory access on multicores with high productivity

Vaibhav Saxena; Yogish Sabharwal; Pramod Bhatotia

The slow progress in memory access latencies in comparison to CPU speeds has resulted in memory accesses dominating code performance. While architectural enhancements have benefited applications with data locality and sequential access, random memory access still remains a cause for concern. Several benchmarks have been proposed to evaluate the random memory access performance on multicore architectures. However, the performance evaluation models used by the existing benchmarks do not fully capture the varying types of random access behaviour arising in practical applications. In this paper, we propose a new model for evaluating the performance of random memory access that better captures the random access behaviour demonstrated by applications in practice. We use our model to evaluate the performance of two popular multicore architectures, the Cell and the GPU. We also suggest novel optimizations on these architectures that significantly boost the performance for random accesses in comparison to conventional architectures. Performance improvements on these architectures typically come at the cost of reduced productivity considering the extra programming effort involved. To address this problem, we propose libraries that incorporate these optimizations and provide innovatively designed programming interfaces that can be used by the applications to achieve good performance without loss of productivity.

ieee international conference on high performance computing data and analytics | 2012

A divide and conquer strategy for scaling weather simulations with multiple regions of interest

Preeti Malakar; Thomas George; Sameer Kumar; Rashmi Mittal; Vijay Natarajan; Yogish Sabharwal; Vaibhav Saxena; Sathish S. Vadhiyar

Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due to their sub-linear scalability. In this work, we present a strategy for parallel execution of multiple nested domain simulations based on partitioning the 2-D processor grid into disjoint rectangular regions associated with each domain. We propose a novel combination of performance prediction, processor allocation methods and topology-aware mapping of the regions on torus interconnects. Experiments on IBM Blue Gene systems using WRF show that the proposed strategies result in performance improvement of up to 33% with topology-oblivious mapping and up to additional 7% with topology-aware mapping over the default sequential strategy.

international conference on parallel processing | 2012

Performance evaluation and optimization of nested high resolution weather simulations

Preeti Malakar; Vaibhav Saxena; Thomas George; Rashmi Mittal; Sameer Kumar; Abdul Ghani Naim; Saiful A. Husain

Weather models with high spatial and temporal resolutions are required for accurate prediction of meso-micro scale weather phenomena. Using these models for operational purposes requires forecasts with sufficient lead time, which in turn calls for large computational power. There exists a lot of prior studies on the performance of weather models on single domain simulations with a uniform horizontal resolution. However, there has not been much work on high resolution nested domains that are essential for high-fidelity weather forecasts. In this paper, we focus on improving and analyzing the performance of nested domain simulations using WRF on IBM Blue Gene/P. We demonstrate a significant reduction (up to 29%) in runtime via a combination of compiler optimizations, mapping of process topology to the physical torus topology, overlapping communication with computation, and parallel communications along torus dimensions. We also conduct a detailed performance evaluation using four nested domain configurations to assess the benefits of the different optimizations as well as the scalability of different WRF operations. Our analysis indicates that the choice of nesting configuration is critical for good performance. To aid WRF practitioners in making this choice, we describe a performance modeling approach that can predict the total simulation time in terms of the domain and processor configurations with a very high accuracy (<8%) using a regression-based model learned from empirical timing data.

ieee international conference on high performance computing, data, and analytics | 2008

Optimization of BLAS on the cell processor

Vaibhav Saxena; Prashant Agrawal; Yogish Sabharwal; Vijay K. Garg; Vimitha Kuruvilla; John A. Gunnels

The unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we presentstrategies for leveraging these features to develop a high performanceBLAS library. We propose techniques to partition and distribute dataacross SPEs for handling DMA efficiently. We show that suitable preprocessingof data leads to significant performance improvements whenthe data is unaligned. In addition, we use a combination of two kernels -a specialized high performance kernel for the more frequently occurringcases and a generic kernel for handling boundary cases - to obtain betterperformance. Using these techniques for double precision, we obtain upto 70-80% of peak performance for different memory bandwidth boundlevel 1 and 2 routines and up to 80-90% for computation bound level 3routines.

Ibm Journal of Research and Development | 2013

Enabling high-resolution forecasting of severe weather and flooding events in Rio de Janeiro

Lloyd A. Treinish; Anthony Paul Praino; James P. Cipriani; Ulisses T. Mello; Kiran Mantripragada; L. C. Villa Real; Paula Aida Sesini; Vaibhav Saxena; Thomas George; R. Mittal

Safe operation of many cities is affected by relative extremes in weather conditions. With precipitation events, local topography and weather influence water runoff and infiltration, which directly affect flooding. Hence, the availability of highly focused predictions has the potential to mitigate the impact of severe weather on a city. Often, such information is simply unavailable. The initial step to address this gap is the application of state-of-the-art weather models at an urban scale calibrated to address this mismatch. The generation of operational forecasts at such a scale for the Rio de Janeiro metropolitan area suggests a horizontal resolution of approximately 1 km and a vertical resolution in the lower boundary layer of tens of meters. Forecasting impacts from storm-driven flooding events requires the development of a coupled hydrological model that operates at a street scale with resolution of approximately 1 m, capturing local terrain effects and simulating surface flow and water accumulation, especially for overland flow and ponding depth. This coupled approach has enabled operational prediction of storm impacts on local infrastructure, as well as measurement of the model error associated with such forecasts.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Architecture Aware Resource Allocation for Structured Grid Applications: Flood Modelling Case

Vaibhav Saxena; Thomas George; Yogish Sabharwal; Lucas Correia Villa Real

Numerous problems in science and engineering involve discretizing the problem domain as a regular structured grid and make use of domain decomposition techniques to obtain solutions faster using high performance computing. However, the load imbalance of the workloads among the various processing nodes can cause severe degradation in application performance. This problem is exacerbated for the case when the computational workload is non-uniform and the processing nodes have varying computational capabilities. In this paper, we present novel local search algorithms for regular partitioning of a structured mesh to heterogeneous compute nodes in a distributed setting. The algorithms seek to assign larger workloads to processing nodes having higher computation capabilities while maintaining the regular structure of the mesh in order to achieve a better load balance. We also propose a distributed memory (MPI) parallelization architecture that can be used to achieve a parallel implementation of scientific modelling software requiring structured grids on heterogeneous processing resources involving CPUs and GPUs. Our implementation can make use of the available CPU cores and multiple GPUs of the underlying platform simultaneously. Empirical evaluation on real world flood modelling domains on a heterogeneous architecture comprising of multicore CPUs and GPUs suggests that the proposed partitioning approach can provide a performance improvement of up to 8× over a naive uniform partitioning.

ieee international conference on high performance computing, data, and analytics | 2013

Evaluation and enhancement of weather application performance on Blue Gene/Q

Gurbinder Gill; Vaibhav Saxena; Rashmi Mittal; Thomas George; Yogish Sabharwal; Lalit Dagar

Numerical weather prediction (NWP) models use mathematical models of the atmosphere to predict the weather. Ongoing efforts in the weather and climate community continuously try to improve the fidelity of weather models by employing higher order numerical methods suitable for solving model equations at high resolutions. In realistic weather forecasting scenario, simulating and tracking multiple regions of interest (nests) at fine resolutions is important in understanding the interplay between multiple weather phenomena and for comprehensive predictions. These multiple regions of interest in a simulation can be significantly different in resolution and other modeling parameters. Currently, the weather simulations involving these nested regions process them one after the other in a sequential fashion. There exists a lot of prior work in performance evaluation and optimization of weather models, however most of this work is either limited to simulations involving a single domain or multiple nests with same resolution and model parameters such as model physics options. In this paper, we evaluate and enhance the performance of popular WRF model on IBM Blue Gene/Q system. We consider nested simulations with multiple child domains and study how parameters such as physics options and simulation time steps for child domains affect the computational requirements. We also analyze how such configurations can benefit from parallel execution of the children domains rather than processing them sequentially. We demonstrate that it is important to allocate processors to nested child domains in proportion to the work load associated with them when executing them in parallel. This ensures that the time spent in the different nested simulations is nearly equal, and the nested domains reach the synchronization step with the parent simulation together. Our experimental evaluation using a simple heuristic for allocation of nodes shows that the performance of WRF simulations can be improved by up to 14% by parallel execution of sibling domains with different configuration of domain sizes, temporal resolutions and physics options.

Archive | 2011