Jenwei Hsieh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jenwei Hsieh is active.

Explore More

Publication

Featured researches published by Jenwei Hsieh.

conference on high performance computing (supercomputing) | 2000

Architectural and Performance Evaluation of GigaNet and Myrinet Interconnects on Clusters of Small-Scale SMP Servers

Jenwei Hsieh; Tau Leng; Victor Mashayekhi; Reza Rooholamini

GigaNet and Myrinet are two of the leading interconnects for clusters of commodity computer systems. Both provide memory-protected user-level network interface access, and deliver low-latency and high-bandwidth communication to applications. GigaNet is a connection-oriented interconnect based on a hardware implementation of Virtual Interface (VI) Architecture and Asynchronous Transfer Mode (ATM) technologies. Myrinet is a connection-less interconnect which leverages packet switching technologies from experimental Massively Parallel Processors (MPP) networks. This paper investigates their architectural differences and evaluates their performance on two commodity clusters based on two generations of Symmetric Multiple Processors (SMP) servers. The performance measurements reported here suggest that the implementation of Message Passing Interface (MPI) significantly affects the cluster performance. Although MPICH-GM over Myrinet demonstrates lower latency with small messages, the polling-driven implementation of MPICH-GM often leads to tight synchronization between communication processes and higher CPU overhead.

international parallel and distributed processing symposium | 2005

Performance implications of virtualization and hyper-threading on high energy physics applications in a grid environment

Laura Gilbert; Jeff Tseng; Rhys A. Newman; Saeed Iqbal; Ronald Pepper; Onur Celebioglu; Jenwei Hsieh; Mark Cobban

The simulations used in the field of high energy physics are compute intensive and exhibit a high level of data parallelism. These features make such simulations ideal candidates for grid computing. We are taking as an example the GEANT4 detector simulation used for physics studies within the ATLAS experiment at CERN. One key issue in grid computing is that of network and system security, which can potentially inhibit the wide spread use of such simulations. Virtualization provides a feasible solution because it allows the creation of virtual compute nodes in both local and remote compute clusters, thus providing an insulating layer which can play an important role in satisfying the security concerns of all parties involved. However, it has performance implications. This study provides quantitative estimates of the virtualization and hyper-threading overhead for GEANT on commodity clusters. Results show that virtualization has less than 15% run-time overhead, and that the best run time (with the non-SMP licence of ESX VMware) is achieved by using one virtual machine per CPU. We also observe that hyper-threading does not provide an advantage in this application. Finally, the effect of virtualization on run-time, throughput, mean response time and utilization is estimated using simulations.

international parallel and distributed processing symposium | 2004

The performance impact of computational efficiency on HPC clusters with hyper-threading technology

Onur Celebioglu; Amina Saify; Tau Leng; Jenwei Hsieh; Victor Mashayekhi; Reza Rooholamini

Summary form only given. The effect of Intel/spl reg/ hyper-threading (HT) technology on a systems performance varies according to the characteristics of the application running on the system and the configuration of the system. High performance computing (HPC) clusters introduce additional variables, such as the math libraries used in solving linear algebra equations that can affect the performance of scientific and engineering applications in particular. We study the effect of HT on MPI-based applications by varying the math library. We configure an Intel-based HPC cluster and used the High Performance Linpack (HPL) benchmark to study the performance characteristics without changing hardware or communication parameters, but by linking the application with different math libraries. What we have found is that, even though the application or hardware parameters remain the same, HT may help or hinder the overall performance of the cluster depending on the computational efficiency of the mathematical functions.

international parallel and distributed processing symposium | 2002

Performance impact of process mapping on small-scale SMP clusters -a case study using high performance linpack

Tau Leng; Rizwan Ali; Jenwei Hsieh; Victor Mashayekhi; Reza Rooholamini

Typically, a High Performance Computing (HPC) cluster loosely couples multiple Symmetric Multi-Processor (SMP) platforms into a single processing complex. Each SMP uses shared memory for its processors to communicate, whereas communication across SMPs goes through the intra-cluster interconnect. By analyzing the communication pattern of processes, it is possible to arrive at a mapping of processes to processors to ensure optimal communication paths for critical traffic. This critical traffic refers to the communication pattern of the program, which can be characterized by either frequency or size (or both) of the messages. To find an ideal mapping, it is imperative to understand the communication characteristics of the SMP memory system, intra-cluster interconnection, and the Message Passing Interface (MPI) program running on a cluster.Our approach is to study the ideal mapping for two classes of interconnects: 1) standard, high-volume Ethernet interconnects, and 2) proprietary, low-latency high-bandwidth interconnects. In the first installment of our work presented in this paper, we have focused on the ideal mapping for the first class.We configured a 16-node dual-processor cluster, interconnected with Fast Ethernet, Gigabit Ethernet, Giganet, and Myrinet. We used High Performance Linpack (HPL) benchmark to demonstrate that re-mapping of processes to processors (or changing the order of processors used) can affect the overall performance. The mappings are based on the HPL program analysis obtained from running a MPI profiling tool. Our results suggest that the performance of HPL using Fast Ethernet as the interconnect can be improved from 10% to 50% depending on the process mapping and the problem size. Conversely, an ad hoc mapping can adversely affect the cluster performance.

international conference on cluster computing | 2000

Impact of level 2 cache and memory subsystem on the scalability of clusters of small-scale SMP servers

Jenwei Hsieh; Tau Leng; Victor Mashayekhi; Reza Rooholamini

This paper presents a performance study of two commodity clusters built from two models of Dell PowerEdge servers. Both clusters have eight servers interconnected by GigaNet for fast message passing and by Fast Ethernet for Network File System (NFS) traffic. The two server models are different in processors, level 2 (L2) cache, speed of front-side bus (FSB), chipsets and memory subsystem. They represent generic servers from two generations of Intel-based architecture. In this study, we use well-known benchmark programs to understand how they perform for computation-intensive applications. We first study their performance in stand-alone environment to unveil the performance characteristic of a compute node. We further explore their aggregated performance when they are used in a cluster environment. We are particularly interested in their scalability, per-processor performance degradation due to memory contention and inter-process communications and the correlation between results from different benchmark programs. We found that L2 cache and memory subsystem have significant impact on computation-intensive parallel applications such as the NAS Parallel Benchmark (NPB) programs. For configurations with a large number of processors (or multiple processors per compute node), some of NPB programs perform better on platform with larger global L2 cache, even though the platform has slower processors, FSB and memory components.

international parallel and distributed processing symposium | 2006

Implications of virtualization on grids for high energy physics applications

Laura Gilbert; Jeff Tseng; Rhys A. Newman; Saeed Iqbal; Ronald Pepper; Onur Celebioglu; Jenwei Hsieh; Victor Mashayekhi; Mark Cobban

The simulations used in the field of high energy physics are compute intensive and exhibit a high level of data parallelism. These features make such simulations ideal candidates for Grid computing. We are taking as an example the GEANT4 detector simulation used for physics studies within the ATLAS experiment at CERN. One key issue in Grid computing is that of network and system security, which can potentially inhibit the widespread use of such simulations. Virtualization provides a feasible solution because it allows the creation of virtual compute nodes in both local and remote compute clusters, thus providing an insulating layer which can play an important role in satisfying the security concerns of all parties involved. However, it has performance implications. This study provides quantitative estimates of the virtualization and hyper-threading overhead for GEANT on commodity clusters. Results show that virtualization has less than 15% run time overhead, and that the best run time (with the non-SMP license of ESX VMware) is achieved by using one virtual machine per CPU. We also observe that hyper-threading does not provide an advantage in this application. Finally, the effect of virtualization on run time, throughput, mean response time and utilization is estimated using simulations.

international symposium on parallel and distributed processing and applications | 2004

Evaluating performance of BLAST on intel xeon and itanium2 processors

Ramesh Radhakrishnan; Rizwan Ali; Garima Kochhar; Kalyana Chadalavada; Ramesh Rajagopalan; Jenwei Hsieh; Onur Celebioglu

High-performance computing has increasingly adopted the use of clustered Intel architecture-based servers. This increase in adoption has been largely fueled by a number of technological enhancements in the Intel architecture-based servers, primarily due to substantial improvement in the Intel processor and memory technology over the past few years. This paper compares the performance characteristics of three Dell PowerEdge (PE) servers that are based on three different Intel processor technologies. They are the PE1750 which is an IA-32 based Xeon system, PE1850 which uses the new 90nm technology Xeon processor at faster frequencies and the PE3250 which is an Itanium2 based system. BLAST (Basic Local Alignment Search Tool), a high performance computing application used in the field of biological research, is used as the workload for this study. The aim is to understand the performance impact of the different features associated with each processor/platform technology when running the BLAST workload.

Archive | 2005