Sadaf R. Alam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sadaf R. Alam is active.

Explore More

Publication

Featured researches published by Sadaf R. Alam.

ieee international symposium on workload characterization | 2006

Characterization of Scientific Workloads on Systems with Multi-Core Processors

Sadaf R. Alam; Richard Frederick Barrett; Jeffery A. Kuehn; Philip C. Roth; Jeffrey S. Vetter

Multi-core processors are planned for virtually all next-generation HPC systems. In a preliminary evaluation of AMD Opteron Dual-Core processor systems, we investigated the scaling behavior of a set of micro-benchmarks, kernels, and applications. In addition, we evaluated a number of processor affinity techniques for managing memory placement on these multi-core systems. We discovered that an appropriate selection of MPI task and memory placement schemes can result in over 25% performance improvement for key scientific calculations. We collected detailed performance data for several large-scale scientific applications. Analyses of the application performance results confirmed our micro-benchmark and scaling results

ieee international conference on high performance computing data and analytics | 2008

Early evaluation of IBM BlueGene/P

Sadaf R. Alam; Richard Frederick Barrett; Michael H Bast; Mark R. Fahey; Jeffery A. Kuehn; Collin McCurdy; James H. Rogers; Philip C. Roth; Ramanan Sankaran; Jeffrey S. Vetter; Patrick H. Worley; Weikuan Yu

BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4s Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.

IEEE Computer | 2007

Using FPGA Devices to Accelerate Biomolecular Simulations

Sadaf R. Alam; Pratul K. Agarwal; Melissa C. Smith; Jeffrey S. Vetter; David E Caliga

A field-programmable gate array implementation of a molecular dynamics simulation method reduces the microprocessor time-to-solution by a factor of three while using only high-level languages. The application speedup on FPGA devices increases with the problem size. The authors use a performance model to analyze the potential of simulating large-scale biological systems faster than many cluster-based supercomputing platforms

conference on high performance computing (supercomputing) | 2007

Cray XT4: an early evaluation for petascale scientific simulation

Sadaf R. Alam; Jeffery A. Kuehn; Richard Frederick Barrett; Jeffrey M. Larkin; Mark R. Fahey; Ramanan Sankaran; Patrick H. Worley

The scientific simulation capabilities of next generation high-end computing technology will depend on striking a balance among memory, processor, I/O, and local and global network performance across the breadth of the scientific simulation space. The Cray XT4 combines commodity AMD dual core Opteron processor technology with the second generation of Crays custom communication accelerator in a system design whose balance is claimed to be driven by the demands of scientific simulation. This paper presents an evaluation of the Cray XT4 using micro-benchmarks to develop a controlled understanding of individual system components, providing the context for analyzing and comprehending the performance of several petascale-ready applications. Results gathered from several strategic application domains are compared with observations on the previous generation Cray XT3 and other high-end computing systems, demonstrating performance improvements across a wide variety of application benchmark problems.

international parallel and distributed processing symposium | 2006

Early evaluation of the Cray XT3

Jeffrey S. Vetter; Sadaf R. Alam; Thomas H. Dunigan; Mark R. Fahey; Philip C. Roth; Patrick H. Worley

Oak Ridge National Laboratory recently received delivery of a 5,294 processor Cray XT3. The XT3 is Crays third-generation massively parallel processing system. The system builds on a single processor node - built around the AMD Opteron - and uses a custom chip - called SeaStar - to provide interprocess or communication. In addition, the system uses a lightweight operating system on the compute nodes. This paper describes our initial experiences with the system, including micro-benchmark, kernel, and application benchmark results. In particular, we provide performance results for strategic Department of Energy applications areas including climate and fusion. We demonstrate experiments on the installed system, scaling applications up to 4,096 processors.

international parallel and distributed processing symposium | 2006

A framework to develop symbolic performance models of parallel applications

Sadaf R. Alam; Jeffrey S. Vetter

Performance and workload modeling has numerous uses at every stage of the high-end computing lifecycle: design, integration, procurement, installation and tuning. Despite the tremendous usefulness of performance models, their construction remains largely a manual, complex, and time-consuming exercise. We propose a new approach to the model construction, called modeling assertions (MA), which borrows advantages from both the empirical and analytical modeling techniques. This strategy has many advantages over traditional methods: incremental construction of realistic performance models, straightforward model validation against empirical data, and intuitive error bounding on individual model terms. We demonstrate this new technique on the NAS parallel CG and SP benchmarks by constructing high fidelity models for the floating-point operation cost, memory requirements, and MPI message volume. These models are driven by a small number of key input parameters thereby allowing efficient design space exploration of future problem sizes and architectures

ieee international conference on high performance computing data and analytics | 2008

An Evaluation of the Oak Ridge National Laboratory Cray XT3

Sadaf R. Alam; Richard Frederick Barrett; Mark R. Fahey; Jeffery A. Kuehn; O. E. Bronson Messer; Richard Tran Mills; Philip C. Roth; Jeffrey S. Vetter; Patrick H. Worley

In 2005, Oak Ridge National Laboratory (ORNL) received delivery of a 5294 processor Cray XT3. The XT3 is Crays third-generation massively parallel processing system. The ORNL system uses a single-processor node built around the AMD Opteron and uses a custom chip—called SeaStar—for interprocessor communication. The system uses a lightweight operating system called Catamount on its compute nodes. This paper provides a performance evaluation of the Cray XT3, including measurements for micro-benchmark, kernel, and application benchmarks. In particular, we provide performance results for strategic Department of Energy applications areas including climate, biology, astrophysics, combustion, and fusion. Our results, on up to 4096 processors, demonstrate that the Cray XT3 provides competitive processor performance, high interconnect bandwidth, and high parallel efficiency on a diverse application workload, typical in the DOE Office of Science.

Advances in Computers | 2008

DARPA's HPCS Program- History, Models, Tools, Languages

Jack J. Dongarra; Robert Graybill; William Harrod; Robert F. Lucas; Ewing L. Lusk; Piotr Luszczek; Janice McMahon; Allan Snavely; Jeffrey S. Vetter; Katherine A. Yelick; Sadaf R. Alam; Roy L. Campbell; Laura Carrington; Tzu-Yi Chen; Omid Khalili; Jeremy S. Meredith; Mustafa M. Tikir

Abstract The historical context with regard to the origin of the DARPA High Productivity Computing Systems (HPCS) program is important for understanding why federal government agencies launched this new, long-term high-performance computing program and renewed their commitment to leadership computing in support of national security, large science and space requirements at the start of the 21st century. In this chapter, we provide an overview of the context for this work as well as various procedures being undertaken for evaluating the effectiveness of this activity including such topics as modelling the proposed performance of the new machines, evaluating the proposed architectures, understanding the languages used to program these machines as well as understanding programmer productivity issues in order to better prepare for the introduction of these machines in the 2011–2015 timeframe.

international parallel and distributed processing symposium | 2007

Analysis of a Computational Biology Simulation Technique on Emerging Processing Architectures

Jeremy S. Meredith; Sadaf R. Alam; Jeffrey S. Vetter

Multi-paradigm, multi-threaded and multi-core computing devices available today provide several orders of magnitude performance improvement over mainstream microprocessors. These devices include the STI Cell Broadband Engine, graphical processing units (GPU) and the Cray massively-multithreaded processors - available in desktop computing systems as well as proposed for supercomputing platforms. The main challenge in utilizing these powerful devices is their unique programming paradigms. GPUs and the Cell systems require code developers to manage code and data explicitly, while the Cray multithreaded architecture requires them to generate a very large number of threads or independent tasks concurrently. In this paper, we explain strategies for optimizing a molecular dynamics (MD) calculation that is used in biomolecular simulations on three devices: Cell, GPU and MTA-2. We show that the Cray MTA-2 system requires minimal code modification and does not outperform the microprocessor runs; but it demonstrates an improved workload scaling behavior over the microprocessor implementation. On the other hand, substantial porting and optimization efforts on the Cell and the GPU systems result in a 5times to 6times improvement, respectively, over a 2.2 GHz Opteron system.

acm sigplan symposium on principles and practice of parallel programming | 2006

Performance characterization of molecular dynamics techniques for biomolecular simulations

Sadaf R. Alam; Jeffrey S. Vetter; Pratul K. Agarwal; Al Geist

Large-scale simulations and computational modeling using molecular dynamics (MD) continues to make significant impacts in the field of biology. It is well known that simulations of biological events at native time and length scales requires computing power several orders of magnitude beyond todays commonly available systems. Supercomputers, such as IBM Blue Gene/L and Cray XT3, will soon make tens to hundreds of teraFLOP/s of computing power available by utilizing thousands of processors. The popular algorithms and MD applications, however, were not initially designed to run on thousands of processors. In this paper, we present detailed investigations of the performance issues, which are crucial for improving the scalability of the MD-related algorithms and applications on massively parallel processing (MPP) architectures. Due to the varying characteristics of biological input problems, we study two prototypical biological complexes that use the MD algorithm: an explicit solvent and an implicit solvent. In particular, we study the AMBER application, which supports a variety of these types of input problems. For the explicit solvent problem, we focused on the particle mesh Ewald (PME) method for calculating the electrostatic energy, and for the implicit solvent model, we targeted the Generalized Born (GB) calculation. We uncovered and subsequently modified a limitation in AMBER that restricted the scaling beyond 128 processors. We collected performance data for experiments on up to 2048 Blue Gene/L and XT3 processors and subsequently identified that the scaling is largely limited by the underlying algorithmic characteristics and also by the implementation of the algorithms. Furthermore, we found that the input problem size of biological system is constrained by memory available per node. In conclusion, our results indicate that MD codes can significantly benefit from the current generation architectures with relatively modest optimization efforts. Nevertheless, the key for enabling scientific breakthroughs lies in exploiting the full potential of these new architectures.

Explore More