Manu Shantharam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manu Shantharam is active.

Explore More

Publication

Featured researches published by Manu Shantharam.

international conference on supercomputing | 2011

Characterizing the impact of soft errors on iterative methods in scientific computing

Manu Shantharam; Sowmyalatha Srinivasmurthy; Padma Raghavan

The increase in on-chip transistor count facilitates achieving higher performance, but at the expense of higher susceptibility to soft errors. In this paper, we characterize the challenges posed by soft errors for large-scale applications representative of workloads on supercomputing systems. Such applications are typically based on the computational solution of partial differential equation models using either explicit or implicit methods. In both cases, the execution time of such applications is typically dominated by the time spent in their underlying sparse matrix vector multiplication kernel (SpMV, t ← A • y). We provide a theoretical analysis of the impact of a single soft error through its propagation by a sequence of sparse matrix vector multiplication operations. Our analysis indicates that a single soft error in some ith component of the vector y can corrupt the entire resultant vector in a relatively short sequence of SpMV operations. Additionally, the propagation pattern corresponds to the sparsity structure of the coefficient matrix A and the magnitude of the error grows non-linearly as(||Ai||2∗)k, after k SpMV operations, where, ||Ai∗||2 is the 2-norm of the ith row of A. We corroborate this analysis with empirical observations on a model heat equation using explicit method and well known sparse matrix systems (matrices from a test suite) for the implicit method using iterative solvers such as CG, PCG and SOR. Our results indicate that explicit schemes will suffer from soft error induced numerical instabilities, thus exacerbating intrinsic stability issues for such methods, that impose constraints on relative time and space step sizes. For implicit schemes, linear solver performance through widely used CG and PCG schemes, degrades by a factor as high as 200x, whereas, a stationary scheme such as SOR is inherently soft error resilient. Our results thus indicate the need for new approaches to achieve soft error resiliency in such methods and a critical evaluation of the tradeoffs among multiple metrics, including, performance, reliability and energy.

international conference on parallel processing | 2012

Adapting Sparse Triangular Solution to GPUs

Brad Suchoski; Caleb Severn; Manu Shantharam; Padma Raghavan

High performance computing systems are increasingly incorporating hybrid CPU/GPU nodes to accelerate the rate at which floating point calculations can be performed for scientific applications. Currently, a key challenge is adapting scientific applications to such systems when the underlying computations are sparse, such as sparse linear solvers for the simulation of partial differential equation models using semi-implicit methods. Now, a key bottleneck is sparse triangular solution for solvers such as preconditioned conjugate gradients (PCG). We show that sparse triangular solution can be effectively mapped to GPUs by extracting very large degrees of fine-grained parallelism using graph coloring. We develop simple performance models to predict these effects at intersection of the data and hardware attributes and we evaluate our scheme on a Nvidia Tesla M2090 GPU relative to the level set scheme developed at NVIDIA. Our results indicate that our approach significantly enhances the available fine-grained parallelism to speed-up PCG iteration time compared to the NVIDIA scheme, by a factor with a geometric mean of 5.41 on a single GPU, with speedups as high as 63 in some cases.

Parallel Processing Letters | 2013

SPEEDUP-AWARE CO-SCHEDULES FOR EFFICIENT WORKLOAD MANAGEMENT

Manu Shantharam; Youngtae Youn; Padma Raghavan

Many HPC systems run workloads comprising scientific simulations where the problem size is often fixed while model parameters are changed for each run. The speedup profile of such applications can often be determined using data from previous runs. Many applications exhibit sublinear speedups and thus only incremental improvements in execution time when larger numbers of processors are used. In this paper, we consider a workload of applications queued for execution and co-schedules that can exploit speedup profiles. More specifically, we seek co-schedules to reduce workload completion time (i.e., makespan), and total energy consumed by the system. Additionally, we explore how co-schedules may reduce average application turnaround time and thus benefit the user. We propose speedup-aware processor partitioning (SAPP) co-scheduling schemes, including optimal and greedy variants. We show that our greedy co-schedules reduce system energy, workload makespan and average application turnaround time when compared to a base scheme that runs each application on all available processors. Our experiments demonstrate that on a system with 128 processors, on average, our SAPP-greedy co-schedules can reduce makespan and turnaround time by 20% and decrease total energy consumed by 40%.

ieee international conference on high performance computing data and analytics | 2011

Exploiting dense substructures for fast sparse matrix vector multiplication

Manu Shantharam; Anirban Chatterjee; Padma Raghavan

The execution time of many scientific computing applications is dominated by the time spent in performing sparse matrix vector multiplication (SMV; y ← A · x). We consider improving the performance of SMV on multicores by exploiting the dense substructures that are inherently present in many sparse matrices derived from partial differential equation models. First, we identify indistinguishable vertices, i.e., vertices with the same adjacency structure, in a graph representation of the sparse matrix (A) and group them into a supernode. Next, we identify effectively dense blocks within the matrix by grouping rows and columns in each supernode. Finally, by using a suitable data structure for this representation of the matrix, we reduce the number of load operations during SMV while exactly preserving the original sparsity structure of A. In addition, we use ordering techniques to enhance locality in accesses to the vector, x, to yield an SMV kernel that exploits the effectively dense substructures in the matrix. We evaluate our scheme on Intel Nehalem and AMD Shanghai processors. We observe that for larger matrices on the Intel Nehalem processor, our method improves performance on average by 37.35% compared with the traditional compressed sparse row scheme (a blocked compressed form improves performance on average by 30.27%). Benefits of our new format are similar for the AMD processor. More importantly, if we pick for each matrix the best among our method and the blocked compressed scheme, the average performance improvements increase to 40.85%. Additional results indicate that the best performing scheme varies depending on the matrix and the system. We therefore propose an effective density measure that could be used for method selection, thus adding to the variety of options for an auto-tuned optimized SMV kernel that can exploit sparse matrix properties and hardware attributes for high performance.

european conference on parallel processing | 2009

Hybrid Techniques for Fast Multicore Simulation

Manu Shantharam; Padma Raghavan; Mahmut T. Kandemir

One of the challenges in the design of multicore architectures concerns the fast evaluation of hardware design-tradeoffs using simulation techniques. Simulation tools for multicore architectures tend to have long execution times that grow linearly with the number of cores simulated. In this paper, we present two hybrid techniques for fast and accurate multicore simulation. Our first method, the Monte Carlo Co-Simulation (MCCS) scheme, considers application phases, and within each phase, interleaves a Monte Carlo modeling scheme with a traditional simulator, such as Simics. Our second method, the Curve Fitting Based Simulation (CFBS) scheme, is tailored to evaluate the behavior of applications with multiple iterations, such as scientific applications that have consistent cycles per instruction (CPI) behavior within a subroutine over different iterations. In our CFBS method, we represent the CPI profile of a subroutine as a signature using curve fitting and represent the entire application execution as a set of signatures to predict performance metrics. Our results indicate that MCCS can reduce simulation time by as much as a factor of 2.37, with a speedup of 1.77 on average compared to Simics. We also observe that CFBS can reduce simulation time by as much as a factor of 13.6, with a speedup of 6.24 on average. The observed average relative errors in CPI compared to Simics are 32% for MCCS and significantly lower, at 2%, for CFBS.

International Journal of High Performance Computing Applications | 2018

Co-scheduling Amdahl applications on cache-partitioned systems

Guillaume Aupy; Anne Benoit; Sicheng Dai; Loïc Pottier; Padma Raghavan; Yves Robert; Manu Shantharam

Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are as follows: (i) which proportion of cache and (ii) how many processors should be given to each application? In this article, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.

international parallel and distributed processing symposium | 2017

Co-Scheduling Algorithms for Cache-Partitioned Systems

Guillaume Aupy; Anne Benoit; Loïc Pottier; Padma Raghavan; Yves Robert; Manu Shantharam

Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. In this paper, we provide answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.

ieee international conference on high performance computing data and analytics | 2015

Tuning tasks, granularity, and scratchpad size for energy efficiency

Pietro Cicotti; Manu Shantharam; Laura Carrington

The process of co-design involves adapting hardware and software design in combination with some optimization goal in mind. Such a process is considered necessary to overcome the performance and energy efficiency challenges in designing Exascale systems and applications, and ensuring that both scientists and system architects can understand the tradeoffs and implications of their design choices. In this paper we evaluate the energy efficiency for an Exascale strawman architecture for two proxy applications from DoE co-design centers: CoMD and HPGMG. The applications were re-written for the Open Community Runtime (OCR), a programming model and runtime system for Exascale research. Then, for the evaluation we used a functional simulator developed by Intel. Specifically, we investigated code variants and system configurations to explore co-design tradeoffs, gain insight on the interplay between application behavior and memory configurations. We observed that in CoMD, using force symmetry is as effective to reduce energy consumption as it is to reduce the amount of computation needed, even if it requires atomic operations or scheduling support. Reducing the granularity of the tasks has a significant overhead which is greater than the potential benefit of increased locality. For HPGMG, we found that changing the size of the scratchpad made no significant difference in terms of energy consumption, suggesting that the code is for the most part not taking advantage of the local memory; finer blocking should be explored to evaluate the balance between the greater locality and the overhead introduced.

international conference on supercomputing | 2012