Johnny Chang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Johnny Chang is active.

Explore More

Publication

Featured researches published by Johnny Chang.

international parallel and distributed processing symposium | 2010

Performance impact of resource contention in multicore systems

Robert Hood; Haoqiang Jin; Piyush Mehrotra; Johnny Chang; M. Jahed Djomehri; Sharad Gavali; Dennis C. Jespersen; Kenichi Taylor; Rupak Biswas

Resource sharing in commodity multicore processors can have a significant impact on the performance of production applications. In this paper we use a differential performance analysis methodology to quantify the costs of contention for resources in the memory hierarchy of several multicore processors used in high-end computers. In particular, by comparing runs that bind MPI processes to cores in different patterns, we can isolate the effects of resource sharing. We use this methodology to measure how such sharing affects the performance of four applications of interest to NASA — OVERFLOW, MITgcm, Cart3D, and NCC. We also use a subset of the HPCC benchmarks and hardware counter data to help interpret and validate our findings. We conduct our study on high-end computing platforms that use four different quad-core microprocessors — Intel Clovertown, Intel Harpertown, AMD Barcelona, and Intel Nehalem-EP. The results help further our understanding of the requirements these codes place on their production environments and also of each computers ability to deliver performance.

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013

Performance Evaluation of the Intel Sandy Bridge Based NASA Pleiades Using Scientific and Engineering Applications

Subhash Saini; Johnny Chang; Haoqiang Jin

We present a performance evaluation of Pleiades based on the Intel Xeon E5-2670 processor, a fourth-generation eight-core Sandy Bridge architecture, and compare it with the previous third generation Nehalem architecture. Several architectural features have been incorporated in Sandy Bridge: (a) four memory channels as opposed to three in Nehalem; (b) memory speed increased from 1333 MHz to 1600 MHz; (c) ring to connect on-chip L3 cache with cores, system agent, memory controller, and QPI agent and I/O controller to increase the scalability; (d) new AVX unit with wider vector registers of 256 bit; (e) integration of PCI-Express 3.0 controllers into the I/O subsystem on chip; (f) new Turbo Boost version 2.0 where base frequency of processor increased from 2.6 to 3.2 GHz; and (g) QPI link rate from 6.4 to 8 GT/s and two QPI links to second socket. We critically evaluate these new features using several low-level benchmarks, and four full-scale scientific and engineering applications.

ieee international conference on high performance computing data and analytics | 2012

An Application-based Performance Evaluation of NASA's Nebula Cloud Computing Platform

Subhash Saini; Steve Heistand; Haoqiang Jin; Johnny Chang; Robert Hood; Piyush Mehrotra; Rupak Biswas

The high performance computing (HPC) community has shown tremendous interest in exploring cloud computing as it promises high potential. In this paper, we examine the feasibility, performance, and scalability of production quality scientific and engineering applications of interest to NASA on NASAs cloud computing platform, called Nebula, hosted at Ames Research Center. This work represents the comprehensive evaluation of Nebula using NUTTCP, HPCC, NPB, I/O, and MPI function benchmarks as well as four applications representative of the NASA HPC workload. Specifically, we compare Nebula performance on some of these benchmarks and applications to that of NASAs Pleiades supercomputer, a traditional HPC system. We also investigate the impact of virtIO and jumbo frames on interconnect performance. Overall results indicate that on Nebula (i) virtIO and jumbo frames improve network bandwidth by a factor of 5x, (ii) there is a significant virtualization layer overhead of about 10% to 25%, (iii) write performance is lower by a factor of 25x, (iv) latency for short MPI messages is very high, and (v) overall performance is 15% to 48% lower than that on Pleiades for NASA HPC applications. We also comment on the usability of the cloud platform.

international parallel and distributed processing symposium | 2015

Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer

Subhash Saini; Haoqiang Jin; Dennis C. Jespersen; Samson Cheung; M. Jahed Djomehri; Johnny Chang; Robert Hood

We have conducted performance evaluation of a dual-rail Fourteen Data Rate (FDR) InfiniBand (IB) connected cluster, where each node has two Intel Xeon E5-2670 (Sandy Bridge) processors and two Intel Xeon Phi coprocessors. The Xeon Phi, based on the Many Integrated Core (MIC) architecture, is of the Knights Corner (KNC) generation. We used several types of benchmarks for the study. We ran the MPI and multi-zone versions of the NAS Parallel Benchmarks (NPB) -- both original and optimized for the Xeon Phi. Among the full-scale benchmarks, we ran two versions of WRF, including one optimized for the MIC, and used a 12 Km Continental U.S (CONUS) data set. We also used original and optimized versions of OVERFLOW and ran with four different datasets to understand scaling in symmetric mode and related load-balancing issues. We present performance for the four different modes of using the host + MIC combination: native host, native MIC, offload, and symmetric. We also discuss the various optimization techniques used in optimizing two of the NPBs for offload mode as well as WRF and OVERFLOW. WRF 3.4 optimized for MIC runs 47% faster than the original NCAR WRF 3.4. The optimized version of OVERFLOW runs 18% faster on the host and the load-balancing strategy used improves the performance on MIC by 5% to 36% depending on the data size. In addition, we discuss the issues related to offload mode and load balancing in symmetric mode.

high performance computing and communications | 2016

Performance Evaluation of an Intel Haswell-and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications

Subhash Saini; Robert T. Hood; Johnny Chang; John Baron

We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5-2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.

ieee international conference on high performance computing data and analytics | 2008

Benchmarking the Columbia Supercluster

Robert T. Hood; Rupak Biswas; Johnny Chang; M. Jahed Djomehri; Haoqiang Jin

Columbia, NASAs 10,240-processor supercluster, has been ranked as one of the fastest computers in the world since November 2004. In this paper we examine the performance characteristics of its production subclusters, which are typically configurations ranging in size from 512 to 2048 processors. We evaluate floating-point performance, memory bandwidth, and message passing communication speeds using a subset of the HPC Challenge benchmarks, the NAS Parallel Benchmarks, and a computational fluid dynamics application. Our experimental results quantify the performance improvement resulting from changes in interconnect bandwidth, processor speed, and cache size across the different types of SGI Altix 3700s that constitute Columbia. We also report on our experiments that investigate the performance impact of processors sharing a path to memory. Finally, our tests of the different interconnect fabrics available indicate substantial promise for scaling applications to run on configurations of more than 512 CPUs.

ieee international conference on high performance computing, data, and analytics | 2012