Ashwin M. Aji
Virginia Tech
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ashwin M. Aji.
ieee international conference on high performance computing data and analytics | 2012
Ashwin M. Aji; James Dinan; Darius Buntinas; Pavan Balaji; Wu-chun Feng; Keith R. Bisset; Rajeev Thakur
Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement frameworks, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACCs runtime system enables several key optimizations, including pipelining of data transfers and balancing of communication based on accelerator and node architecture. We demonstrate the extensible design of MPIACC by using the popular CUDA and OpenCL accelerator programming interfaces. We examine the impact of MPI-ACC on communication performance and evaluate application-level benefits on a large-scale epidemiology simulation.
computing frontiers | 2008
Ashwin M. Aji; Wu-chun Feng; Filip Blagojevic; Dimitrios S. Nikolopoulos
This paper presents and evaluates a model and a methodology for implementing parallel wavefront algorithms on the Cell Broadband Engine. Wavefront algorithms are vital in several application areas such as computational biology, particle physics, and systems of linear equations. The model uses blocked data decomposition with pipelined execution of blocks across the synergistic processing elements (SPEs) of the Cell. To evaluate the model, we implement the Smith-Waterman sequence alignment algorithm as a wavefront algorithm and present key optimization techniques that complement the vector processing capabilities of the SPE. Our results show perfect linear speedup for up to 16 SPEs on the QS20 dual-Cell blades, and our model shows that our implementation is highly scalable for more cores, if available. Furthermore, the accuracy of our model is within 3% of the measured values on average. Lastly, we also test our model in a throughput-oriented experimental setting, where we couple the model with scheduling techniques that exploit parallelism across the simultaneous execution of multiple sequence alignments. Using our model, we improved the throughput of realistic multisequence alignment workloads by up to 8% compared to FCFS (first-come, first-serve), by trading off parallelism within alignments with parallelism across alignments.
international conference on parallel and distributed systems | 2009
Shucai Xiao; Ashwin M. Aji; Wu-chun Feng
Graphics processing units (GPUs) have been widely used to accelerate algorithms that exhibit massive data parallelism or task parallelism. When such parallelism is not inherent in an algorithm, computational scientists resort to simply replicating the algorithm on every multiprocessor of a NVIDIA GPU, for example, to create such parallelism, resulting in embarrassingly parallel ensemble runs that deliver significant aggregate speed-up. However, the fundamental issue with such ensemble runs is that the problem size to achieve this speed-up is limited to the available shared memory and cache of a GPU multiprocessor. An example of the above is dynamic programming (DP), one of the Berkeley 13 dwarfs. All known DP implementations to date use the coarse-grained approach of embarrassingly parallel ensemble runs because a fine-grained parallelization on the GPU would require extensive communication between the multiprocessors of a GPU, which could easily cripple performance as communication between multiprocessors is not natively supported in a GPU. Consequently, we address the above by proposing a fine-grained parallelization of a single instance of the DP algorithm that is mapped to the GPU. Our parallelization incorporates a set of techniques aimed to substantially improve GPU performance: matrix re-alignment, coalesced memory access, tiling, and GPU (rather than CPU) synchronization. The specific DP algorithm that we parallelize is called Smith-Waterman (SWat), which is an optimal local-sequence alignment algorithm. We then use this SWat algorithm as a baseline to compare our GPU implementation, i.e., CUDA-SWat, to our implementation on the Cell Broadband Engine, i.e., Cell-SWat.
computational science and engineering | 2010
Ashwin M. Aji; Liqing Zhang; Wu-chun Feng
Next-generation, high-throughput sequencers are now capable of producing hundreds of billions of short sequences (reads) in a single day. The task of accurately mapping the reads back to a reference genome is of particular importance because it is used in several other biological applications, e.g., genome re-sequencing, DNA methylation, and ChiP sequencing. On a personal computer (PC), the computationally intensive short-read mapping task currently requires several hours to execute while working on very large sets of reads and genomes. Accelerating this task requires parallel computing. Among the current parallel computing platforms, the graphics processing unit (GPU) provides massively parallel computational prowess that holds the promise of accelerating scientific applications at low cost. In this paper, we propose GPU-RMAP, a massively parallel version of the RMAP short-read mapping tool that is highly optimized for the NVIDIA family of GPUs. We then evaluate GPU-RMAP by mapping millions of synthetic and real reads of varying widths on the mosquito (Aedes aegypti) and human genomes. We also discuss the effects of various input parameters, such as read width, number of reads, and chromosome size, on the performance of GPU-RMAP. We then show that despite using the conventionally “slower” but GPU-compatible binary search algorithm, GPU-RMAP outperforms the sequential RMAP implementation, which uses the “faster” hashing technique on a PC. Our data-parallel GPU implementation results in impressive speedups of up to 14:5-times for the mapping kernel and up to 9:6-times for the overall program execution time over the sequential RMAP implementation on a traditional PC.
computing frontiers | 2011
Ashwin M. Aji; Mayank Daga; Wu-chun Feng
Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge and complement these performance models and tools, by modeling and analyzing a lesser known, but very severe performance pitfall, called Partition Camping, in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of GPU kernels by up to seven-fold. There is no existing tool that can detect the partition camping effect in GPU kernels. Unlike the traditional performance modeling approaches, we predict a performance range that bounds the partition camping effect in the GPU kernel. Our idea of predicting a performance range, instead of the exact performance, is more realistic due to the large performance variations induced by partition camping. We design and develop the prediction model by first characterizing the effects of partition camping with an indigenous suite of micro-benchmarks. We then apply rigorous statistical regression techniques over the micro-benchmark data to predict the performance bounds of real GPU kernels, with and without the partition camping effect. We test the accuracy of our performance model by analyzing three real applications with known memory access patterns and partition camping effects. Our results show that the geometric mean of errors in our performance range prediction model is within 12% of the actual execution times. We also develop and present a very easy-to-use spreadsheet based tool called CampProf, which is a visual front-end to our performance range prediction model and can be used to gain insights into the degree of partition camping in GPU kernels. Lastly, we demonstrate how CampProf can be used to visually monitor the performance improvements in the kernels, as the partition camping effect is being removed.
international conference on distributed computing systems | 2013
Palden Lama; Yan Li; Ashwin M. Aji; Pavan Balaji; James Dinan; Shucai Xiao; Yunquan Zhang; Wu-chun Feng; Rajeev Thakur; Xiaobo Zhou
Power-hungry Graphics processing unit (GPU) accelerators are ubiquitous in high performance computing data centers today. GPU virtualization frameworks introduce new opportunities for effective management of GPU resources by decoupling them from application execution. However, power management of GPU-enabled server clusters faces significant challenges. The underlying system infrastructure shows complex power consumption characteristics depending on the placement of GPU workloads across various compute nodes, power-phases and cabinets in a datacenter. GPU resources need to be scheduled dynamically in the face of time-varying resource demand and peak power constraints. We propose and develop a power-aware virtual OpenCL (pVOCL) framework that controls the peak power consumption and improves the energy efficiency of the underlying server system through dynamic consolidation and power-phase topology aware placement of GPU workloads. Experimental results show that pVOCL achieves significant energy savings compared to existing power management techniques for GPU-enabled server clusters, while incurring negligible impact on performance. It drives the system towards energy-efficient configurations by taking an optimal sequence of adaptation actions in a virtualized GPU environment and meanwhile keeps the power consumption below the peak power budget.
international parallel and distributed processing symposium | 2012
Keith R. Bisset; Ashwin M. Aji; Eric J. Bohm; Laxmikant V. Kalé; Tariq Kamal; Madhav V. Marathe; Jae-Seung Yeom
Preventing and controlling outbreaks of infectious diseases such as pandemic influenza is a top public health priority. EpiSimdemics is an implementation of a scalable parallel algorithm to simulate the spread of contagion, including disease, fear and information, in large (108 individuals), realistic social contact networks using individual-based models. It also has a rich language for describing public policy and agent behavior. We describe CharmSimdemics and evaluate its performance on national scale populations. Charm++ is a machine independent parallel programming system, providing high-level mechanisms and strategies to facilitate the task of developing highly complex parallel applications. Our design includes mapping of application entities to tasks, leveraging the efficient and scalable communication, synchronization and load balancing strategies of Charm++. Our experimental results on a 768 core system show that the Charm++ version achieves up to a 4-fold increase in performance when compared to the MPI version.
international parallel and distributed processing symposium | 2012
Feng Ji; Ashwin M. Aji; James Dinan; Darius Buntinas; Pavan Balaji; Wu-chun Feng; Xiaosong Ma
Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.
ieee international conference on high performance computing data and analytics | 2012
Feng Ji; Ashwin M. Aji; James Dinan; Darius Buntinas; Pavan Balaji; Rajeev Thakur; Wu-chun Feng; Xiaosong Ma
Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We present a detailed design of our new approach to intranode communication and evaluate its improvement to communication and application performance using micro-kernel benchmarks and a 2D stencil application kernel.
bioinformatics and bioengineering | 2008
Ashwin M. Aji; Wu-chun Feng
The Smith-Waterman algorithm is a dynamic programming method for determining optimal local alignments between nucleotide or protein sequences. However, it suffers from quadratic time and space complexity. As a result, many algorithmic and architectural enhancements have been proposed to solve this problem, but at the cost of reduced sensitivity in the algorithms or significant expense in hardware, respectively. This paper presents a highly efficient parallelization of the Smith-Waterman algorithm on the Cell Broadband Engine platform, a novel hybrid multicore architecture that drives the low-cost PlayStation 3 (PS3) game consoles as well as the IBM BladeCenter Q22, which currently powers the fastest supercomputer in the world, Roadrunner at Los Alamos National Laboratory. Through an innovative mapping of the optimal Smith-Waterman algorithm onto a cluster of PlayStation 3 nodes, our implementation delivers 21 to 55-fold speed-up over a high-end multicore architecture and up to 449-fold speed-up over the PowerPC processor in the PS3. Next, we evaluate the trade-offs between our Smith- Waterman implementation on the Cell with existing software and hardware implementations and show that our solution achieves the best performance-to-price ratio, when aligning realistic sequences sizes and generating the actual alignment. Finally, we show that our low-cost solution on a PS3 cluster approaches the speed of BLAST while achieving ideal sensitivity. To quantify the relationship between the two algorithms in terms of speed and sensitivity, we formally define and quantify the sensitivity of homology search methods so that trade-offs between sequence-search solutions can be evaluated in a quantitative manner.