Padma Raghavan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Padma Raghavan is active.

Explore More

Publication

Featured researches published by Padma Raghavan.

international parallel and distributed processing symposium | 2005

Reducing power with performance constraints for parallel sparse applications

Guangyu Chen; Konrad Malkowski; Mahmut T. Kandemir; Padma Raghavan

Sparse and irregular computations constitute a large fraction of applications in the data-intensive scientific domain. While every effort is made to balance the computational workload in such computations across parallel processors, achieving sustained near machine-peak performance with close-to-ideal load balanced computation-to-processor mapping is inherently difficult. As a result, most of the time, the loads assigned to parallel processors can exhibit significant variations. While there have been numerous past efforts that study this imbalance from the performance viewpoint, to our knowledge, no prior study has considered exploiting the imbalance for reducing power consumption during execution. Power consumption in large-scale clusters of workstations is becoming a critical issue as noted by several recent research papers from both industry and academia. Focusing on sparse matrix computations in which underlying parallel computations and data dependencies can be represented by trees, this paper proposes schemes that save power through voltage/frequency scaling. Our goal is to reduce overall energy consumption by scaling the voltages/frequencies of those processors that are not in the critical path; i.e., our approach is oriented towards saving power without incurring performance penalties.

SIAM Journal on Matrix Analysis and Applications | 1995

A Cartesian Parallel Nested Dissection Algorithm

Michael T. Heath; Padma Raghavan

This paper is concerned with the distributed parallel computation of an ordering for a symmetric positive definite sparse matrix. The purpose of the ordering is to limit fill and enhance concurrency in the subsequent Cholesky factorization of the matrix. A geometric approach to nested dissection is used based on a given Cartesian embedding of the graph of the matrix in Euclidean space. The resulting algorithm can be implemented efficiently on massively parallel, distributed memory computers. One unusual feature of the distributed algorithm is that its effectiveness does not depend on data locality, which is critical in this context, since an appropriate partitioning of the problem is not known until after the ordering has been determined. The ordering algorithm is the first component in a suite of scalable parallel algorithms currently under development for solving large sparse linear systems on massively parallel computers.

Siam Journal on Scientific and Statistical Computing | 1989

Distributed orthogonal factorization: givens and householder algorithms

Alex Pothen; Padma Raghavan

Several algorithms for orthogonal factorization on distributed memory multiprocessors are designed and implemented. Two of the algorithms employ Householder transformations, a third is based on Givens rotations, and a fourth hybrid algorithm uses Householder transformations and Givens rotations in different phases.The arithmetic and communication complexities of the algorithms are analyzed. The analyses show that the sequential arithmetic terms play a more important role than the communication terms in determining the running times and efficiencies of these algorithms. The hybrid algorithm is the fastest algorithm overall, since its arithmetic cost is lower than the Householder algorithms and its communication cost does not increase with the column length of the matrix. The observed execution times of the implementations on an iPSC-286 agree quite well with the complexity analyses. It is also shown that the efficiencies can be approximated using only the arithmetic costs of the algorithms.

Archive | 2006

Parallel processing for scientific computing

Michael A. Heroux; Padma Raghavan; Horst D. Simon

List of Figures List of Tables Preface 1. Frontiers of Scientific Computing. An Overview Part I. Performance Modeling, Analysis and Optimization 2. Performance Analysis. From Art to Science 3. Approaches to Architecture-Aware Parallel Scientific Computation 4. Achieving High Performance on the BlueGene/L Supercomputer 5. Performance Evaluation and Modeling of Ultra-Scale Systems Part II. Parallel Algorithms and Enabling Technologies 6. Partitioning and Load Balancing 7. Combinatorial Parallel and Scientific Computing 8. Parallel Adaptive Mesh Refinement 9. Parallel Sparse Solvers, Preconditioners, and Their Applications 10. A Survey of Parallelization Techniques for Multigrid Solvers 11. Fault Tolerance in Large-Scale Scientific Computing Part III. Tools and Frameworks for Parallel Applications 12. Parallel Tools and Environments. A Survey 13. Parallel Linear Algebra Software 14. High-Performance Component Software Systems 15. Integrating Component-Based Scientific Computing Software Part IV. Applications of Parallel Computing 16. Parallel Algorithms for PDE-Constrained Optimization 17. Massively Parallel Mixed-Integer Programming 18. Parallel Methods and Software for Multicomponent Simulations 19. Parallel Computational Biology 20. Opportunities and Challenges for Parallel Computing in Science and Engineering Index.

international conference on supercomputing | 2011

Characterizing the impact of soft errors on iterative methods in scientific computing

Manu Shantharam; Sowmyalatha Srinivasmurthy; Padma Raghavan

The increase in on-chip transistor count facilitates achieving higher performance, but at the expense of higher susceptibility to soft errors. In this paper, we characterize the challenges posed by soft errors for large-scale applications representative of workloads on supercomputing systems. Such applications are typically based on the computational solution of partial differential equation models using either explicit or implicit methods. In both cases, the execution time of such applications is typically dominated by the time spent in their underlying sparse matrix vector multiplication kernel (SpMV, t ← A • y). We provide a theoretical analysis of the impact of a single soft error through its propagation by a sequence of sparse matrix vector multiplication operations. Our analysis indicates that a single soft error in some ith component of the vector y can corrupt the entire resultant vector in a relatively short sequence of SpMV operations. Additionally, the propagation pattern corresponds to the sparsity structure of the coefficient matrix A and the magnitude of the error grows non-linearly as(||Ai||2∗)k, after k SpMV operations, where, ||Ai∗||2 is the 2-norm of the ith row of A. We corroborate this analysis with empirical observations on a model heat equation using explicit method and well known sparse matrix systems (matrices from a test suite) for the implicit method using iterative solvers such as CG, PCG and SOR. Our results indicate that explicit schemes will suffer from soft error induced numerical instabilities, thus exacerbating intrinsic stability issues for such methods, that impose constraints on relative time and space step sizes. For implicit schemes, linear solver performance through widely used CG and PCG schemes, degrades by a factor as high as 200x, whereas, a stationary scheme such as SOR is inherently soft error resilient. Our results thus indicate the need for new approaches to achieve soft error resiliency in such methods and a critical evaluation of the tradeoffs among multiple metrics, including, performance, reliability and energy.

international parallel and distributed processing symposium | 2008

A helper thread based EDP reduction scheme for adapting application execution in CMPs

Yang Ding; Mahmut T. Kandemir; Padma Raghavan; Mary Jane Irwin

In parallel to the changes in both the architecture domain - the move toward chip multiprocessors (CMPs) - and the application domain - the move toward increasingly data-intensive workloads - issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The CPU availability can change dynamically due to several reasons such as thermal overload, increase in transient errors, or operating system scheduling. An important question in this context is how to adapt, in a CMP, the execution of a given application to CPU availability change at runtime. Our paper studies this problem, targeting the energy-delay product (EDP) as the main metric to optimize. We first discuss that, in adapting the application execution to the varying CPU availability, one needs to consider the number of CPUs to use, the number of application threads to accommodate and the voltage/frequency levels to employ (if the CMP has this capability). We then propose to use helper threads to adapt the application execution to CPU availability change in general with the goal of minimizing the EDP. The helper thread runs parallel to the application execution threads and tries to determine the ideal number of CPUs, threads and voltage/frequency levels to employ at any given point in execution. We illustrate this idea using two applications (Fast Fourier Transform and MultiGrid) under different execution scenarios. The results collected through our experiments are very promising and indicate that significant EDP reductions are possible using helper threads. For example, we achieved up to 66.3% and 83.3% savings in EDP when adjusting all the parameters properly in applications FFT and MG, respectively.

SIAM Journal on Matrix Analysis and Applications | 1999

Performance of Greedy Ordering Heuristics for Sparse Cholesky Factorization

Esmond G. Ng; Padma Raghavan

Greedy algorithms for ordering sparse matrices for Cholesky factorization can be based on different metrics. Minimum degree, a popular and effective greedy ordering scheme, minimizes the number of nonzero entries in the rank-1 update (degree) at each step of the factorization. Alternatively, minimum deficiency minimizes the number of nonzero entries introduced (deficiency) at each step of the factorization. In this paper we develop two new heuristics: modified minimum deficiency (MMDF) and modified multiple minimum degree (MMMD). The former uses a metric similar to deficiency while the latter uses a degree-like metric. Our experiments reveal that on the average, MMDF orderings result in 21% fewer operations to factor than minimum degree; MMMD orderings result in 15% fewer operations to factor than minimum degree. MMMD requires on the average 7--13% more time than minimum degree, while MMDF requires on the average 33--34% more time than minimum degree.

ieee international conference on high performance computing data and analytics | 2012

NUMA-aware graph mining techniques for performance and energy efficiency

Michael R. Frasca; Kamesh Madduri; Padma Raghavan

We investigate dynamic methods to improve the power and performance profiles of large irregular applications on modern multi-core systems. In this context, we study a large sparse graph application, Betweenness Centrality, and focus on memory behavior as core count scales. We introduce new techniques to efficiently map the computational demands onto non-uniform memory architectures (NUMA). Our dynamic design adapts to hardware topology and dramatically improves both energy and performance. These gains are more significant at higher core counts. We implement a scheme for adaptive data layout, which reorganizes the graph after observing parallel access patterns, and a dynamic task scheduler that encourages shared data between neighboring cores. We measure performance and energy consumption on a modern multi-core machine and observe that mean execution time is reduced by 51.2% and energy is reduced by 52.4%.

international parallel and distributed processing symposium | 2010

Analyzing the soft error resilience of linear solvers on multicore multiprocessors

Konrad Malkowski; Padma Raghavan; Mahmut T. Kandemir

As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16%, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 109, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5%, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5% and 14% respectively.

Information Processing and Management | 2001

Level search schemes for information filtering and retrieval

Xiaoyan Zhang; Michael W. Berry; Padma Raghavan

Latent semantic indexing (LSI) has been demonstrated to outperform lexical matching in information retrieval. However, the enormous cost associated with the singular value decomposition (SVD) of the large term-by-document matrix becomes a barrier for its application to scalable information retrieval. This work shows that information filtering using level search techniques can reduce the SVD computation cost for LSI. For each query, level search extracts a much smaller subset of the original term-by-document matrix, containing on average 27% of the original non-zero entries. When LSI is applied to such subsets, the average precision can degrade by as much as 23% due to level search filtering. However, for some document collections an increase in precision has also been observed. Further enhancement of level search can be based on a pruning scheme which deletes terms connected to only one document from the query-specific submatrix. Such pruning has achieved a 65% reduction (on average) in the number of non-zeros with a precision loss of 5% for most collections.

Explore More