Sudip K. Seal
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sudip K. Seal.
simulation tools and techniques for communications, networks and system | 2010
Brandon G. Aaby; Kalyan S. Perumalla; Sudip K. Seal
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.
Simulation | 2012
Kalyan S. Perumalla; Sudip K. Seal
In complex phenomena such as epidemiological outbreaks, the intensity of inherent feedback effects and the significant role of transients in the dynamics make simulation the only effective method for proactive, reactive or post facto analysis. The spatial scale, runtime speed, and behavioral detail needed in detailed simulations of epidemic outbreaks cannot be supported by sequential or small-scale parallel execution, making it necessary to use large-scale parallel processing. Here, an optimistic parallel execution of a new discrete event formulation of a reaction–diffusion simulation model of epidemic propagation is presented to facilitate a dramatic increase in the fidelity and speed by which epidemiological simulations can be performed. Rollback support needed during optimistic parallel execution is achieved by combining reverse computation with a small amount of incremental state saving. Parallel speedup of over 5,500 and other runtime performance metrics of the system are observed with weak-scaling execution on a small (8,192-core) Blue Gene/P system, while scalability with a weak-scaling speedup of over 10,000 is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes, with mobility and detailed state evolution modeled at the level of each individual, exceeding several hundreds of millions of individuals in the largest cases, are successfully exercised to verify model scalability.
IEEE Transactions on Antennas and Propagation | 2011
Vikram R. Melapudi; B. Shanker; Sudip K. Seal; Srinivas Aluru
The development of the multilevel fast multipole algorithm (MLFMA) and its multiscale variants have enabled the use of integral equation (IE) based solvers to compute scattering from complicated structures. Development of scalable parallel algorithms, to extend the reach of these solvers, has been a topic of intense research for about a decade. In this paper, we present a new algorithm for parallel implementation of IE solver that is augmented with a wideband MLFMA and scalable on large number of processors. The wideband MLFMA employed here, to handle multiscale problems, is a hybrid combination of the accelerated Cartesian expansion (ACE) and the classical MLFMA. The salient feature of the presented parallel algorithm is that it is implicitly load balanced and exhibits higher performance. This is achieved by developing a strategy to partition the MLFMA tree, and hence the associated computations, in a self-similar fashion among the parallel processors. As detailed in the paper, the algorithm employs both spatial and direction partitioning approaches in a flexible manner to ensure scalable performance. Plethora of results are presented here to exhibit the scalability of this algorithm on 512 and more processors.
international conference on parallel processing | 2006
Srikanta Tirthapura; Sudip K. Seal; Srinivas Aluru
Spacefilling curves (SFCs) are widely used for parallel domain decomposition in scientific computing applications. The proximity preserving properties of SFCs are expected to keep most accesses local in applications that require efficient access to spatial neighborhoods. While experimental results are used to confirm this behavior, a rigorous mathematical analysis of SFCs turns out to be rather hard and rarely attempted. In this paper, we analyze SFC based parallel domain decomposition for a uniform random spatial distribution in three dimensions. Let n denote the expected number of points and P denote the number of processors. We show that the expected distance along an SFC to a nearest neighbor is O(n2/3). We then consider the problem of answering nearest neighbor and spherical region queries for each point. For P = nalpha (0 < alpha les 1) processors, we show that the total number of remote accesses grows as O(nfrac34+alpha/4). This analysis shows that the expected number of total remote accesses is sublinear for any sublinear number of processors. We view the analysis presented here as a step towards the goal of understanding the utility of SFCs in scientific applications and the analysis of more complex spatial distributions
Information Processing Letters | 2005
Sudip K. Seal; Srikanth Komarina; Srinivas Aluru
Microarrays are used for measuring expression levels of thousands of genes simultaneously. Clustering algorithms are used on gene expression data to find co-regulated genes. An often used clustering strategy is the Pearson correlation coefficient based hierarchical clustering algorithm presented in [Proc. Nat. Acad. Sci. 95 (25) (1998) 14863-14868], which takes O(N^3) time. We note that this run time can be reduced to O(N^2) by applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 619-628] to this problem. In this paper, we present an algorithm which runs in O(NlogN) time using a geometrical reduction and show that it is optimal.
workshop on parallel and distributed simulation | 2009
Kalyan S. Perumalla; Brandon G. Aaby; Srikanth B. Yoginath; Sudip K. Seal
A methodology and its associated algorithms are presented for mapping a novel, field-based vehicular mobility model onto graphical processing unit computational platform for simulating mobility in large-scale road networks. Of particular focus is the achievement of real-time execution, on desktop platforms, of vehicular mobility on road networks comprised of millions of nodes and links, and multi-million counts of simultaneously active vehicles. The methodology is realized in a system called GARFIELD, whose implementation details and performance study are described. The runtime characteristics of a prototype implementation are presented that show real-time performance in simulations of networks at the scale of a few states of the US road networks.
workshop on parallel and distributed simulation | 2010
Kalyan S. Perumalla; Sudip K. Seal
The spatial scale, runtime speed, and behavioral detail of epidemic outbreak simulations altogether require the use of large-scale parallel processing. Here, an optimistic parallel discrete event execution of a reaction-diffusion simulation model of epidemic outbreaks is presented, with an implementation using the μsik simulator. Rollback support is achieved with the development of a novel reversible model that combines reverse computation with a small amount of incremental state saving. Parallel speedup and other runtime performance metrics of the system are tested on a small (8,192-core) Blue Gene / P system, while scalability is demonstrated on 65,536 cores of a large Cray XT5 system. Scenarios representing large population sizes (up to several hundreds of million individual in the largest case) are exercised.
Plasma Physics and Controlled Fusion | 2015
A. Wingen; N.M. Ferraro; M.W. Shafer; E.A. Unterberg; John M. Canik; Todd Evans; D. L. Hillis; S.P. Hirshman; Sudip K. Seal; Philip B. Snyder; A.C. Sontag
Calculations of the plasma response to applied non-axisymmetric fields in several DIII-D discharges show that predicted displacements depend strongly on the edge current density. This result is found using both a linear two-fluid-MHD model (M3D-C1) and a nonlinear ideal-MHD model (VMEC). Furthermore, it is observed that the probability of a discharge being edge localized mode (ELM)-suppressed is most closely related to the edge current density, as opposed to the pressure gradient. It is found that discharges with a stronger kink response are closer to the peeling–ballooning stability limit in ELITE simulations and eventually cross into the unstable region, causing ELMs to reappear. Thus for effective ELM suppression, the RMP has to prevent the plasma from generating a large kink response, associated with ELM instability. Experimental observations are in agreement with the finding; discharges which have a strong kink response in the MHD simulations show ELMs or ELM mitigation during the RMP phase of the experiment, while discharges with a small kink response in the MHD simulations are fully ELM suppressed in the experiment by the applied resonant magnetic perturbation. The results are cross-checked against modeled 3D ideal MHD equilibria using the VMEC code. The procedure of constructing optimal 3D equilibria for diverted H-mode discharges using VMEC is presented. Kink displacements in VMEC are found to scale with the edge current density, similar to M3D-C1, but the displacements are smaller. A direct correlation in the flux surface displacements to the bootstrap current is shown.
Journal of Parallel and Distributed Computing | 2013
Sudip K. Seal; Kalyan S. Perumalla; S.P. Hirshman
Direct solvers based on prefix computation and cyclic reduction algorithms exploit the special structure of tridiagonal systems of equations to deliver better parallel performance compared to those designed for more general systems of equations. This performance advantage is even more pronounced for block tridiagonal systems. In this paper, we re-examine the performances of these two algorithms taking the effects of block size into account. Depending on the block size, the parameter space spanned by the number of block rows, size of the blocks and the processor count is shown to favor one or the other of the two algorithms. A critical block size that separates these two regions is shown to emerge and its dependence both on problem dependent parameters and on machine-specific constants is established. Empirical verification of these analytical findings is carried out on up to 2048 cores of a Cray XT4 system.
ACM Transactions on Modeling and Computer Simulation | 2011
Sudip K. Seal; Kalyan S. Perumalla
Radio signal strength estimation is essential in many applications, including the design of military radio communications and industrial wireless installations. For scenarios with large or richly featured geographical volumes, parallel processing is required to meet the memory and computation time demands. Here, we present a scalable and efficient parallel execution of the sequential model for radio signal propagation recently developed by Nutaro et al. [2008]. Starting with that model, we (a) provide a vector-based reformulation that has significantly lower computational overhead for event handling, (b) develop a parallel decomposition approach that is amenable to reversibility with minimal computational overheads, (c) present a framework for transparently mapping the conservative time-stepped model into an optimistic parallel discrete event execution, (d) present a new reversible method, along with its analysis and implementation, for inverting the vector-based event model to be executed in an optimistic parallel style of execution, and (e) present performance results from implementation on Cray XT platforms. We demonstrate scalability, with the largest runs tested on up to 127,500 cores of a Cray XT5, enabling simulation of larger scenarios and with faster execution than reported before on the radio propagation model. This also represents the first successful demonstration of the ability to efficiently map a conservative time-stepped model to an optimistic discrete-event execution.