Sameer Shende
University of Oregon
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sameer Shende.
ieee international conference on high performance computing data and analytics | 2006
Sameer Shende; Allen D. Malony
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrmentation and measurement, and how effectively they are integrated and composed. This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describe how it addresses diverse requirements for performance observation and analysis.
Computational Science & Discovery | 2009
J.H. Chen; Alok N. Choudhary; B.R. de Supinski; M. DeVries; Evatt R. Hawkes; Scott Klasky; Wei-keng Liao; Kwan-Liu Ma; John M. Mellor-Crummey; N Podhorszki; Ramanan Sankaran; Sameer Shende; Chun Sang Yoo
Computational science is paramount to the understanding of underlying processes in internal combustion engines of the future that will utilize non-petroleum-based alternative fuels, including carbon-neutral biofuels, and burn in new combustion regimes that will attain high efficiency while minimizing emissions of particulates and nitrogen oxides. Next-generation engines will likely operate at higher pressures, with greater amounts of dilution and utilize alternative fuels that exhibit a wide range of chemical and physical properties. Therefore, there is a significant role for high-fidelity simulations, direct numerical simulations (DNS), specifically designed to capture key turbulence-chemistry interactions in these relatively uncharted combustion regimes, and in particular, that can discriminate the effects of differences in fuel properties. In DNS, all of the relevant turbulence and flame scales are resolved numerically using high-order accurate numerical algorithms. As a consequence terascale DNS are computationally intensive, require massive amounts of computing power and generate tens of terabytes of data. Recent results from terascale DNS of turbulent flames are presented here, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame. Computing at this scale requires close collaborations between computer and combustion scientists to provide optimized scaleable algorithms and software for terascale simulations, efficient collective parallel I/O, tools for volume visualization of multiscale, multivariate data and automating the combustion workflow. The enabling computer science, applied to combustion science, is also required in many other terascale physics and engineering simulations. In particular, performance monitoring is used to identify the performance of key kernels in the DNS code, S3D and especially memory intensive loops in the code. Through the careful application of loop transformations, data reuse in cache is exploited thereby reducing memory bandwidth needs, and hence, improving S3Ds nodal performance. To enhance collective parallel I/O in S3D, an MPI-I/O caching design is used to construct a two-stage write-behind method for improving the performance of write-only operations. The simulations generate tens of terabytes of data requiring analysis. Interactive exploration of the simulation data is enabled by multivariate time-varying volume visualization. The visualization highlights spatial and temporal correlations between multiple reactive scalar fields using an intuitive user interface based on parallel coordinates and time histogram. Finally, an automated combustion workflow is designed using Kepler to manage large-scale data movement, data morphing, and archival and to provide a graphical display of run-time diagnostics.
ieee international conference on high performance computing data and analytics | 2006
Benjamin A. Allan; Robert C. Armstrong; David E. Bernholdt; Felipe Bertrand; Kenneth Chiu; Tamara L. Dahlgren; Kostadin Damevski; Wael R. Elwasif; Thomas Epperly; Madhusudhan Govindaraju; Daniel S. Katz; James Arthur Kohl; Manoj Kumar Krishnan; Gary Kumfert; J. Walter Larson; Sophia Lefantzi; Michael J. Lewis; Allen D. Malony; Lois C. Mclnnes; Jarek Nieplocha; Boyana Norris; Steven G. Parker; Jaideep Ray; Sameer Shende; Theresa L. Windus; Shujia Zhou
The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance coputing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed coputing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal ovehead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including cobustion research, global climate simulation, and computtional chemistry.
Parallel Tools Workshop | 2012
Andreas Knüpfer; Christian Rössel; Dieter an Mey; Scott Biersdorff; Kai Diethelm; Dominic Eschweiler; Markus Geimer; Michael Gerndt; Daniel Lorenz; Allen D. Malony; Wolfgang E. Nagel; Yury Oleynik; Peter Philippen; Pavel Saviankou; Dirk Schmidl; Sameer Shende; Ronny Tschüter; Michael Wagner; Bert Wesarg; Felix Wolf
This paper gives an overview about the Score-P performance measurement infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both the developer and the user perspectives, and presents the design and components of the newly developed Score-P performance measurement infrastructure. Furthermore, it contains first evaluation results in comparison with existing performance tools and presents an outlook to the long-term cooperative development of the new system.
european conference on parallel processing | 2003
Robert Bell; Allen D. Malony; Sameer Shende
This paper presents the design, implementation, and application of ParaProf, a portable, extensible, and scalable tool for parallel performance profile analysis. ParaProf attempts to offer “best of breed” capabilities to performance analysts – those inherited from a rich history of single processor profilers and those being pioneered in parallel tools research. We present ParaProf as a parallel profile analysis framework that can be retargeted and extended as required. ParaProf’s design and operation is discussed, and its novel support for large-scale parallel analysis demonstrated with a 512-processor application profile generated using the TAU performance system.
The Journal of Supercomputing | 2002
Bernd Mohr; Allen D. Malony; Sameer Shende; Felix Wolf
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications.
measurement and modeling of computer systems | 1998
Sameer Shende; Allen D. Malony; Janice E. Cuny; Peter H. Beckman; Steve Karmesin; Kathleen Lindlan
1. Abstract Performance measurement of parallel, objectoriented (00) programs requires the development of instrumentation and analysis techniques beyond those used for more traditional languages. Performance events must be redefined for the conceptual 00 programming model, and those events must be instrumented and tracked in the context of 00 language abstractions, compilation methods, and runtime execution dynamics. In this paper, we focus on the profiling and tracing of C++ applications that have been written using a rich parallel programming framework for highperformance, scientific computing. We address issues of class-based profiling, instrumentation of templates, runtime function identification, and polymorphic (type-based) profiling. Our solutions are implemented in the TAU portable profiling package which also provides support for profiling groups and userlevel timers. We demonstrate TAU’s C++ profiling capabilities for real parallel applications, built from components of the ACTS toolkit. Future directions include work on runtime performance data access, dynamic instrumentation, and higher-level performance data analysis and visualization that relates object semantics with performance execution behavior.
conference on high performance computing (supercomputing) | 2000
Kathleen Lindlan; Janice E. Cuny; Allen D. Malony; Sameer Shende; Reid Rivenburgh; Craig Edward Rasmussen; Bernd Mohr
The developers of high-performance scientific applications often work in complex computing environments that place heavy demands on program analysis tools. The developers need tools that interoperate, are portable across machine architectures, and provide source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT’s current applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most important contribution -- its handling of templates -- as well as its use in existing applications.
Distributed and parallel systems | 2000
Allen D. Malony; Sameer Shende
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.
international conference on parallel processing | 2011
Allen D. Malony; Scott Biersdorff; Sameer Shende; Heike Jagode; Stanimire Tomov; Guido Juckeland; Robert Dietrich; Duncan Poole; Christopher Lamb
The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIAs CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support.