Haoqiang Jin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haoqiang Jin is active.

Explore More

Publication

Featured researches published by Haoqiang Jin.

parallel computing | 2011

High performance computing using MPI and OpenMP on multi-core parallel systems

Haoqiang Jin; Dennis C. Jespersen; Piyush Mehrotra; Rupak Biswas; Lei Huang; Barbara M. Chapman

The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.

scientific cloud computing | 2012

Performance evaluation of Amazon EC2 for NASA HPC applications

Piyush Mehrotra; Jahed Djomehri; Steve Heistand; Robert Hood; Haoqiang Jin; Arthur Lazanoff; Subhash Saini; Rupak Biswas

Cloud computing environments are now widely available and are being increasingly utilized for technical computing. They are also being touted for high-performance computing (HPC) applications in science and engineering. For example, Amazon EC2 Services offers a specialized Cluster Compute instance to run HPC applications. In this paper, we compare the performance characteristics of Amazon EC2 HPC instances to that of NASAs Pleiades supercomputer, an SGI ICE cluster. For this study, we utilized the HPCC kernels and the NAS Parallel Benchmarks along with four full-scale applications from the repertoire of codes that are being used by NASA scientists and engineers. We compare the total runtime of these codes for varying number of cores. We also break out the computation and communication times for a subset of these applications to explore the effect of interconnect differences on the two systems. In general, the single node performance of the two platforms is equivalent. However, for most of the codes when scaling to larger core counts, the performance of EC2 HPC instance generally lags that of Pleiades due to worse network performance of the former. In addition to analyzing application performance, we also briefly touch upon the overhead due to virtualization and the usability of cloud environments such as Amazon EC2.

international parallel and distributed processing symposium | 2004

Performance characteristics of the multi-zone NAS parallel benchmarks

Haoqiang Jin; R.F. Van der Wijngaart

Summary form only given. We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multizone, is extended from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multilevel programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes.

ieee international conference on high performance computing data and analytics | 2000

Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes

Haoqiang Jin; Michael Frumkin; Jerry C. Yan

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMPbased parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.

conference on high performance computing (supercomputing) | 2005

An Application-Based Performance Characterization of the Columbia Supercluster

Rupak Biswas; M. Jahed Djomehri; Robert Hood; Haoqiang Jin; Cetin Kiris; Subhash Saini

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as one of the fastest computers in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floatingpoint performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and In- finiBand interconnects hold promise for multi-node application scaling to at least 2048 processors.

international parallel and distributed processing symposium | 2010

Performance impact of resource contention in multicore systems

Robert Hood; Haoqiang Jin; Piyush Mehrotra; Johnny Chang; M. Jahed Djomehri; Sharad Gavali; Dennis C. Jespersen; Kenichi Taylor; Rupak Biswas

Resource sharing in commodity multicore processors can have a significant impact on the performance of production applications. In this paper we use a differential performance analysis methodology to quantify the costs of contention for resources in the memory hierarchy of several multicore processors used in high-end computers. In particular, by comparing runs that bind MPI processes to cores in different patterns, we can isolate the effects of resource sharing. We use this methodology to measure how such sharing affects the performance of four applications of interest to NASA — OVERFLOW, MITgcm, Cart3D, and NCC. We also use a subset of the HPCC benchmarks and hardware counter data to help interpret and validate our findings. We conduct our study on high-end computing platforms that use four different quad-core microprocessors — Intel Clovertown, Intel Harpertown, AMD Barcelona, and Intel Nehalem-EP. The results help further our understanding of the requirements these codes place on their production environments and also of each computers ability to deliver performance.

ieee international conference on high performance computing data and analytics | 2008

Scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 supercomputers

Subhash Saini; Dale Talcott; Dennis C. Jespersen; M. Jahed Djomehri; Haoqiang Jin; Rupak Biswas

The suitability of next-generation high-performance computing systems for petascale simulations will depend on various performance factors attributable to processor, memory, local and global network, and input/output characteristics. In this paper, we evaluate performance of new dual-core SGI Altix 4700, quad-core SGI Altix ICE 8200, and dual-core IBM POWER5+ systems. To measure performance, we used micro-benchmarks from High Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), and four real-world applications---three from computational fluid dynamics (CFD) and one from climate modeling. We used the micro-benchmarks to develop a controlled understanding of individual system components, then analyzed and interpreted performance of the NPBs and applications. We also explored the hybrid programming model (MPI+OpenMP) using multi-zone NPBs and the CFD application OVERFLOW-2. Achievable application performance is compared across the systems. For the ICE platform, we also investigated the effect of memory bandwidth on performance by testing 1, 2, 4, and 8 cores per node.

international parallel and distributed processing symposium | 2003

Performance analysis of multilevel parallel applications on shared memory architectures

Gabriele Jost; Haoqiang Jin; Jesús Labarta; Judit Gimenez; Jordi Caubet

In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We apply the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory computer architectures.

Scientific Programming | 2010

Enabling locality-aware computations in OpenMP

Lei Huang; Haoqiang Jin; Liqi Yi; Barbara M. Chapman

Locality of computation is key to obtaining high performance on a broad variety of parallel architectures and applications. It is moreover an essential component of strategies for energy-efficient computing. OpenMP is a widely available industry standard for shared memory programming. With the pervasive deployment of multi-core computers and the steady growth in core count, a productive programming model such as OpenMP is increasingly expected to play an important role in adapting applications to this new hardware. However, OpenMP does not provide the programmer with explicit means to program for locality. Rather it presents the user with a “flat” memory model. In this paper, we discuss the need for explicit programmer control of locality within the context of OpenMP and present some ideas on how this might be accomplished. We describe potential extensions to OpenMP that would enable the user to manage a programs data layout and to align tasks and data in order to minimize the cost of data accesses. We give examples showing the intended use of the proposed features, describe our current implementation and present some experimental results. Our hope is that this work will lead to efforts that would help OpenMP to be a major player on emerging, multi- and many-core architectures.

ieee international conference on high performance computing, data, and analytics | 2011

The impact of hyper-threading on processor resource utilization in production applications

Subhash Saini; Haoqiang Jin; Robert Hood; David Barker; Piyush Mehrotra; Rupak Biswas

Intel provides Hyper-Threading (HT) in processors based on its Pentium and Nehalem micro-architecture such as the Westmere-EP. HT enables two threads to execute on each core in order to hide latencies related to data access. These two threads can execute simultaneously, filling unused stages in the functional unit pipelines. To aid better understanding of HT-related issues, we collect Performance Monitoring Unit (PMU) data (instructions retired; unhalted core cycles; L2 and L3 cache hits and misses; vector and scalar floating-point operations, etc.). We then use the PMU data to calculate a new metric of efficiency in order to quantify processor resource utilization and make comparisons of that utilization between single-threading (ST) and HT modes. We also study performance gain using unhalted core cycles, code efficiency of using vector units of the processor, and the impact of HT mode on various shared resources like L2 and L3 cache. Results using four full-scale, production-quality scientific applications from computational fluid dynamics (CFD) used by NASA scientists indicate that HT generally improves processor resource utilization efficiency, but does not necessarily translate into overall application performance gain.

Explore More