Dhairya Malhotra | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dhairya Malhotra is active.

Explore More

Publication

Featured researches published by Dhairya Malhotra.

ieee international conference on high performance computing data and analytics | 2010

Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

Abtin Rahimian; Ilya Lashuk; Shravan Veerapaneni; Aparna Chandramowlishwaran; Dhairya Malhotra; Logan Moon; Rahul S. Sampath; Aashay Shringarpure; Jeffrey S. Vetter; Richard W. Vuduc; Denis Zorin; George Biros

We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 200 million deformable RBCs. The largest simulation amounts to 90 billion unknowns in space. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of O(1,000-10,000) RBCs. Our approach has three distinct characteristics: (1) we faithfully represent the physics of RBCs by using nonlinear solid mechanics to capture the deformations of each cell; (2) we accurately resolve the long-range, N-body, hydrodynamic interactions between RBCs (which are caused by the surrounding plasma); and (3) we allow for the highly non-uniform distribution of RBCs in space. The new method has been implemented in the software library MOBO (for “Moving Boundaries”). We designed MOBO to support parallelism at all levels, including inter-node distributed memory parallelism, intra-node shared memory parallelism, data parallelism (vectorization), and fine-grained multithreading for GPUs. We have implemented and optimized the majority of the computation kernels on both Intel/AMD x86 and NVidias Tesla/Fermi platforms for single and double floating point precision. Overall, the code has scaled on 256 CPU-GPUs on the Teragrids Lincoln cluster and on 200,000 AMD cores of the Oak Ridge national Laboratorys Jaguar PF system. In our largest simulation, we have achieved 0.7 Petaflops/s of sustained performance on Jaguar.

SIAM Journal on Scientific Computing | 2016

FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube

Amir Gholami; Dhairya Malhotra; Hari Sundar; George Biros

From molecular dynamics and quantum chemistry, to plasma physics and computational astrophysics, Poisson solvers in the unit cube are used in many applications in computational science and engineering. In this work, we benchmark and discuss the performance of the scalable methods for the Poisson problem which are used widely in practice: the fast Fourier transform (FFT), the fast multipole method (FMM), the geometric multigrid (GMG), and algebraic multigrid (AMG). Our focus is on solvers supporting high-order, highly nonuniform discretizations. To allow comparisons with standard libraries we also compare adaptive solvers with solvers specialized for problems on regular grids, that is, FFT and regular-stencil multigrid, since both are very popular algorithms for several practical applications. For the multigrid, we use the finite element variant of a high-performance geometric multigrid (HPGMG) benchmark. In total we compare five different codes, three of which are developed in our group. Our FFT, GMG, and...

ACM Transactions on Mathematical Software | 2016

Algorithm 967: A Distributed-Memory Fast Multipole Method for Volume Potentials

Dhairya Malhotra; George Biros

The solution of a constant-coefficient elliptic Partial Differential Equation (PDE) can be computed using an integral transform: A convolution with the fundamental solution of the PDE, also known as a volume potential. We present a Fast Multipole Method (FMM) for computing volume potentials and use them to construct spatially adaptive solvers for the Poisson, Stokes, and low-frequency Helmholtz problems. Conventional N-body methods apply to discrete particle interactions. With volume potentials, one replaces the sums with volume integrals. Particle N-body methods can be used to accelerate such integrals. but it is more efficient to develop a special FMM. In this article, we discuss the efficient implementation of such an FMM. We use high-order piecewise Chebyshev polynomials and an octree data structure to represent the input and output fields and enable spectrally accurate approximation of the near-field and the Kernel Independent FMM (KIFMM) for the far-field approximation. For distributed-memory parallelism, we use space-filling curves, locally essential trees, and a hypercube-like communication scheme developed previously in our group. We present new near and far interaction traversals that optimize cache usage. Also, unlike particle N-body codes, we need a 2:1 balanced tree to allow for precomputations. We present a fast scheme for 2:1 balancing. Finally, we use vectorization, including the AVX instruction set on the Intel Sandy Bridge architecture to get better than 50% of peak floating-point performance. We use task parallelism to employ the Xeon Phi on the Stampede platform at the Texas Advanced Computing Center (TACC). We achieve about 600gflop/s of double-precision performance on a single node. Our largest run on Stampede took 3.5s on 16K cores for a problem with 18e+9 unknowns for a highly nonuniform particle distribution (corresponding to an effective resolution exceeding 3e+23 unknowns since we used 23 levels in our octree).

ieee international conference on high performance computing data and analytics | 2013

Algorithms for high-throughput disk-to-disk sorting

Hari Sundar; Dhairya Malhotra; Karl W. Schulz

In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of IO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sortBenchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the IO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary IO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.

ieee international conference on high performance computing data and analytics | 2014

A volume integral equation stokes solver for problems with variable coefficients

Dhairya Malhotra; Amir Gholami; George Biros

We present a novel numerical scheme for solving the Stokes equation with variable coefficients in the unit box. Our scheme is based on a volume integral equation formulation. Compared to finite element methods, our formulation decouples the velocity and pressure, generates velocity fields that are by construction divergence free to high accuracy and its performance does not depend on the order of the basis used for discretization. In addition, we employ a novel adaptive fast multipole method for volume integrals to obtain a scheme that is algorithmically optimal. Our scheme supports non-uniform discretizations and is spectrally accurate. To increase per node performance, we have integrated our code with both NVIDIA and Intel accelerators. In our largest scalability test, we solved a problem with 20 billion unknowns, using a 14-order approximation for the velocity, on 2048 nodes of the Stampede system at the Texas Advanced Computing Center. We achieved 0.656 peta FLOPS for the overall code (23% efficiency) and one peta FLOPS for the volume integrals (33% efficiency). As an application example, we simulate Stokes ow in a porous medium with highly complex pore structure using a penalty formulation to enforce the no slip condition.

ieee international conference on high performance computing data and analytics | 2016

A parallel arbitrary-order accurate AMR algorithm for the scalar advection-diffusion equation

Arash Bakhtiari; Dhairya Malhotra; Amir Raoofy; Miriam Mehl; Hans-Joachim Bungartz; George Biros

We present a numerical method for solving the scalar advection-diffusion equation using adaptive mesh refinement. Our solver has three unique characteristics: (1) it supports arbitrary-order accuracy in space; (2) it allows different discretizations for the velocity and scalar advected quantity; (3) it combines the method of characteristics with an integral equation formulation; and (4) it supports shared and distributed memory architectures. In particular, our solver is based on a second-order accurate, unconditionally stable, semi-Lagrangian scheme combined with a spatially-adaptive Chebyshev octree for discretization. We study the convergence, single-node performance, strong scaling, and weak scaling of our scheme for several challenging flows that cannot be resolved efficiently without using high-order accurate discretizations. For example, we consider problems for which switching from 4th order to 14th order approximation results in two orders of magnitude speedups for a computation in which we keep the target accuracy in the solution fixed. For our largest run, we solve a problem with one billion unknowns on a tree with maximum depth equal to 10 and using 14th-order elements on 16,384 x86 cores on the “STAMPEDE“ system at the Texas Advanced Computing Center.

international conference on parallel and distributed systems | 2014

Performance analysis of HPC applications with irregular tree data structures

Ahmed Khawaja; Jiajun Wang; Andreas Gerstlauer; Lizy Kurian John; Dhairya Malhotra; George Biros

Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.

international conference on supercomputing | 2013