Aparna Chandramowlishwaran

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aparna Chandramowlishwaran is active.

Explore More

Publication

Featured researches published by Aparna Chandramowlishwaran.

ieee international conference on high performance computing data and analytics | 2010

Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

Abtin Rahimian; Ilya Lashuk; Shravan Veerapaneni; Aparna Chandramowlishwaran; Dhairya Malhotra; Logan Moon; Rahul S. Sampath; Aashay Shringarpure; Jeffrey S. Vetter; Richard W. Vuduc; Denis Zorin; George Biros

We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 200 million deformable RBCs. The largest simulation amounts to 90 billion unknowns in space. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of O(1,000-10,000) RBCs. Our approach has three distinct characteristics: (1) we faithfully represent the physics of RBCs by using nonlinear solid mechanics to capture the deformations of each cell; (2) we accurately resolve the long-range, N-body, hydrodynamic interactions between RBCs (which are caused by the surrounding plasma); and (3) we allow for the highly non-uniform distribution of RBCs in space. The new method has been implemented in the software library MOBO (for “Moving Boundaries”). We designed MOBO to support parallelism at all levels, including inter-node distributed memory parallelism, intra-node shared memory parallelism, data parallelism (vectorization), and fine-grained multithreading for GPUs. We have implemented and optimized the majority of the computation kernels on both Intel/AMD x86 and NVidias Tesla/Fermi platforms for single and double floating point precision. Overall, the code has scaled on 256 CPU-GPUs on the Teragrids Lincoln cluster and on 200,000 AMD cores of the Oak Ridge national Laboratorys Jaguar PF system. In our largest simulation, we have achieved 0.7 Petaflops/s of sustained performance on Jaguar.

international parallel and distributed processing symposium | 2010

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Aparna Chandramowlishwaran; Samuel Williams; Leonid Oliker; Ilya Lashuk; George Biros; Richard W. Vuduc

This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25× on Intels quad-core Nehalem, 9.4× on AMDs quad-core Barcelona, and 37.6× on Suns Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIAs most advanced GPU architecture.

international parallel and distributed processing symposium | 2010

Performance evaluation of concurrent collections on high-performance multicore computing systems

Aparna Chandramowlishwaran; Kathleen Knobe; Richard W. Vuduc

This paper is the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art multicore systems: (i) a recently proposed asynchronous-parallel Cholesky factorization algorithm, (ii) a novel and non-trivial “higher-level” partly-asynchronous generalized eigensolver for dense symmetric matrices. Given a well-tuned sequential BLAS, our implementations match or exceed competing multithreaded vendor-tuned codes by up to 2.6×. Our evaluation compares with alternative models, including ScaLAPACK with a shared memory MPI, OpenMP, Cilk++, and PLASMA 2.0, on Intel Harpertown, Nehalem, and AMD Barcelona systems. Looking forward, we identify new opportunities to improve the CnC language and runtime scheduling and execution.

workshop on declarative aspects of multicore programming | 2009

Declarative aspects of memory management in the concurrent collections parallel programming model

Zoran Budimlic; Aparna Chandramowlishwaran; Kathleen Knobe; Geoff Lowney; Vivek Sarkar; Leo Treggiari

Concurrent Collections (CnC) is a declarative parallel language that allows the application developer to express their parallel application as a collection of high-level computations called steps that communicate via single-assignment data structures called items. A CnC program is specified in two levels. At the bottom level, an existing imperative language implements the computations within the individual computation steps. At the top level, CnC describes the relationships (ordering constraints) among the steps. The memory management mechanism of the existing imperative language manages data whose lifetime is within a computation step. A key limitation in the use of CnC for long-running programs is the lack of memory management and garbage collection for data items with lifetimes that are longer than a single computation step. Although the goal here is the same as that of classical garbage collection, the nature of problem and therefore nature of the solution is distinct. The focus of this paper is the memory management problem for these data items in CnC. We introduce a new declarative slicing annotation for CnC that can be transformed into a reference counting procedure for memory management. Preliminary experimental results obtained from a Cholesky example show that our memory management approach can result in space reductions for CnC data items of up to 28x relative to the baseline case of standard CnC without memory management.

international conference on parallel processing | 2008

On the Design of Fast Pseudo-Random Number Generators for the Cell Broadband Engine and an Application to Risk Analysis

David A. Bader; Aparna Chandramowlishwaran; Virat Agarwal

Numerical simulations in computational physics, biology, and finance, often require the use of high quality and efficient parallel random number generators. We design and optimize several parallel pseudo random number generators on the cell broadband engine, with minimal correlation between the parallel streams: the linear congruential generator (LCG) with 64-bit prime addend and the Mersenne Twister (MT) algorithm. As compared with current Intel and AMD microprocessors, our Cell/B.E. LCG and MT implementations achieve a speed up of 33 and 29, respectively. We also explore two normalization techniques, Gaussian averaging method and box Mueller polar/cartesian, that transform uniform random numbers to a Gaussian distribution. Using these fast generators we develop a parallel implementation of value at risk, a commonly used model for risk assessment in financial markets. To our knowledge we have designed and implemented the fastest parallel pseudo random number generators on the Cell/B.E.

architectural support for programming languages and operating systems | 2014

A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

JeeWhan Choi; Aparna Chandramowlishwaran; Kamesh Madduri; Richard W. Vuduc

This paper presents an optimized CPU--GPU hybrid implementation and a GPU performance model for the kernel-independent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to another highly optimized GPU implementation, our implementation achieves as much as a 1.9× speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the execution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedule that yields the best performance. In the best case, we achieve a speedup of 1.5× compared to our GPU-only implementation, despite the large difference in computational powers of CPUs and GPUs. We comment on one consequence of having such performance models, which is to enable speculative predictions about FMM scalability on future systems.

acm symposium on parallel algorithms and architectures | 2012

Brief announcement: towards a communication optimal fast multipole method and its implications at exascale

Aparna Chandramowlishwaran; JeeWhan Choi; Kamesh Madduri; Richard W. Vuduc

This paper presents the first in-depth models for compute and memory costs of the kernel-independent Fast Multipole Method (KIFMM). The Fast Multiple Method (FMM) has asymptotically linear time complexity with a guaranteed approximation accuracy, making it an attractive candidate for a wide variety of particle system simulations on future exascale systems. This paper reports on three key advances. First, we present lower bounds on cache complexity for key phases of the FMM and use these bounds to derive analytical performance models. Secondly, using these models, we present results for choosing the optimal algorithmic tuning parameter. Lastly, we use these performance models to make predictions about FMMs scalability on possible exascale system configurations, based on current technology trends. Looking forward to exascale, we suggest that the FMM, though highly compute-bound on todays systems, could in fact become memory-bound by 2020.

acm sigplan symposium on principles and practice of parallel programming | 2010

Applying the concurrent collections programming model to asynchronous parallel dense linear algebra

Aparna Chandramowlishwaran; Kathleen Knobe; Richard W. Vuduc

This poster is a case study on the application of a novel programming model, called Concurrent Collections (CnC), to the implementation of an asynchronous-parallel algorithm for computing the Cholesky factorization of dense matrices. In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. We demonstrate the performance potential of CnC in this poster, by showing that our Cholesky implementation nearly matches or exceeds competing vendor-tuned codes and alternative programming models. We conclude that the CnC model is well-suited for expressing asynchronous-parallel algorithms on emerging multicore systems.

international parallel and distributed processing symposium | 2012

Courses in High-performance Computing for Scientists and Engineers

Richard W. Vuduc; Kenneth Czechowski; Aparna Chandramowlishwaran; Jee Whan Choi

This paper reports our experiences in reimplementing an entry-level graduate course in high-performance parallel computing aimed at physical scientists and engineers. These experiences have directly informed a significant redesign of a junior/senior undergraduate course, Introduction to High-Performance Computing (CS 4225 at Georgia Tech), which we are implementing for the current Spring 2012 semester. Based on feedback from the graduate version, the redesign of the undergraduate course emphasizes peer instruction and hands-on activities during the traditional lecture periods, as well as significant time for end-to-end projects. This paper summarizes our anecdotal findings from the graduate versions exit surveys and briefly outlines our plans for the undergraduate course.

international parallel and distributed processing symposium | 2012

Communication-Optimal Parallel N-body Solvers

Aparna Chandramowlishwaran; Richard W. Vuduc

We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N-body problems. Our research specifically addresses two key challenges. The first challenge is how to engineer fast code for todays platforms. We present the first in-depth study of multicore optimizations and tuning for FMM, along with a systematic approach for transforming a conventionally parallelized FMM into a highly-tuned one. We introduce novel optimizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and many core systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter-node communication costs. This analysis yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs-if there are no significant change-could cause it to become communication-bound as early as the year 2020. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of high-level algorithm-architecture co-design.

Explore More