Zhigang Huo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhigang Huo is active.

Explore More

Publication

Featured researches published by Zhigang Huo.

international conference on cluster computing | 2010

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration

Xiang Zhang; Zhigang Huo; Jie Ma; Dan Meng

As one of the key characteristics of virtualization, live virtual machine (VM) migration provides great benefits for load balancing, power management, fault tolerance and other system maintenance issues in modern clusters and data centers. Although Pre-Copy is a widespread used migration algorithm, it does transfer a lot of duplicated memory image data from source to destination, which results in longer migration time and downtime. This paper proposes a novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration. MDD utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration. Experiment demonstrates that compared with Xens default Pre-Copy migration algorithm, MDD can reduce 56.60% of total data transferred during migration, 34.93% of total migration time, and 26.16% of downtime on average.

ieee international conference on high performance computing data and analytics | 2005

High performance Sockets over kernel level virtual interface architecture

Zhigang Huo; Yansong Yu; Ninghui Sun

The Sockets application programming interface is the de facto standard in network programming. Sockets emulation over high performance networks has being pursued by many researchers. Most projects in this area favor user level communication, but this approach has resulted in some compatibility problems. In this paper, after the reexamination of the tradeoff between user level and kernel level communication, the design and implementation of Sockvia are discussed which is a kernel level Sockets emulation system based on virtual interface architecture. Sockvia emulates Sockets streaming semantics and achieves full compatibility with Sockets over TCP/IP. Through performance optimization methods such as lightweight flow control and private buffer, the performance of Sockvia is very attractive compared with that of Sockets over GM-IP or SGM. The half round-trip latency of Sockvia is below 12 us and the peak bandwidth is over 240 MBytes. The results of real-world application tests are also presented

Journal of Computer Science and Technology | 2011

Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure

Ninghui Sun; Jing Xing; Zhigang Huo; Guangming Tan; Jin Xiong; Bo Li; Can Ma

Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 2010. In this paper, key issues in the system design of Dawning Nebulae are introduced. System tuning methodologies aiming at petaFLOPS Linpack result are presented, including algorithmic optimization and communication improvement. The design of its file I/O subsystem, including HVFS and the underlying DCFS3, is also described. Performance evaluations show that the Linpack efficiency of each node reaches 69.89%, and 1024-node aggregate read and write bandwidths exceed 100 GB/s and 70GB/s respectively. The success of Dawning Nebulae has demonstrated the viability of CPU/GPU heterogeneous structure for future designs of supercomputers.

grid and cooperative computing | 2003

Grid Gateway: Message-Passing between Separated Cluster Interconnects

Wei Cui; Jie Ma; Zhigang Huo

Geographically distributed computing requires high-performance clusters to be integrated to solve problems in computational Grid. Because cluster interconnect is isolated, its low-level communication protocol doesn’t exchange messages with others directly. This paper presents a plug-in, Grid Gateway, which enables separated low-level communication protocols to communicate with each other. Grid Gateway can be used in many topologies of inter-cluster network. It has some dynamic features, such as support for multi-gateway mechanism to enhance communication performance. Grid Gateway allows low-level communication protocol to involve in the high-performance Grid computing. Thus it is expected to support the implementation of Grid-enabled tools over it, such as Grid-enabled MPI. This paper describes its architecture and implementation, and presents some design issues.

networking architecture and storages | 2009

Early Experiences with Write-Write Design of NFS over RDMA

Bo Li; Panyong Zhang; Zhigang Huo; Dan Meng

The Network File System (NFS) protocol, as the de facto standard for sharing files in a distributed environment, has deployed Infiniband as the underlying transport of sunRPC, namely NFS over RDMA. In the current Read-Write design of NFS over RDMA, NFS write performance is limited for not fully utilizing the features of Infiniband. In this paper, we take on the challenge of enhancing the write performance of NFS. We propose and evaluate a new design of sunRPC over RDMA, namely Write-Write design. To guarantee the security of our design, we propose an HCA-based memory protection extension of Infiniband. Evaluations show that our Write-Write design increases the kernel-to-kernel RPC bandwidth by 15~27%. In real disk test, our Write-Write design gains 15%~22% in multi-client benchmarks compared with the Read-Write design.

international conference on cluster computing | 2010

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability

Bo Li; Zhigang Huo; Panyong Zhang; Dan Meng

As one of the most important enabling technologies of cloud computing, virtualization brings to HPC good manageability, online system maintenance, performance isolation and fault isolation. Furthermore, previous study on VMM-bypass I/O that virtualizes OS-bypass networks (e.g. InfiniBand) relieved the worry of performance degradation coming along with virtualization. In this paper, we address the scalability challenges imposed upon OS-bypass networks under virtualized environments. The eXtended Reliable Connection (XRC) transport, proposed in modern high-speed interconnection networks to address the scalability problem in large scale applications, would not work in virtualized environments. To solve the problem, we propose VM-proof XRC design to eliminate the scalability gap between virtualized and native environments. Prototype evaluation shows that the virtualization of modern high-speed interconnection networks could get the same raw performance and scalability as in native non-virtualized environment with our VM-proof XRC design. The connection memory scalability shows a potential of 16 times improvement on virtualized clusters composed of 16-core nodes.

parallel and distributed computing: applications and technologies | 2011

Optimizing MPI Alltoall Communication of Large Messages in Multicore Clusters

Qiang Li; Zhigang Huo; Ninghui Sun

MPI All to all communication is widely used in many high performance computing (HPC) applications. In All to all communication, each process sends a distinct message to all other participating processes. In multicore clusters, processes within a node simultaneously contend for the same network resource of the node in All to all communication. However, many small synchronization messages are required in All to all communication of large messages. With the contention, their latency is orders of magnitude larger than that without contention. As a result, the synchronization overhead is significantly increased and accounts for a large proportion to the whole latency of All to all communication. In this paper, we analyse the considerable overhead of synchronization messages. Base on the analysis, an optimization is presented to reduce the number of synchronization messages from 3N to 2¡ÌN. Evaluations on a 240-core cluster show that the performance is improved by almost constant ratio, which is mainly determined by message size and independent of system scale. The performance of All to all communication is improved by 25% for 32K and 64K bytes messages. For FFT application, performance is improved by 20%.

international conference on cluster computing | 2009

DCR: A fully transparent checkpoint/restart framework for distributed systems

Can Ma; Zhigang Huo; Jingnan Cai; Dan Meng

Checkpoint/restart has been widely used in computing systems for fault tolerance, job scheduling and system maintenance purposes. However, the lack of transparency has hindered adoptions of many implementations of it. In this paper, we present a fully transparent parallel checkpoint/restart framework, DCR, which takes the advantages of kernel-level checkpointing method and TCP session preservation. DCR is fully transparent to application programmers and users. No source code modifications, recompilations, or system call interceptions are required. Because of the simplicity of its design and the dominance of TCP/IP in parallel applications, DCR can be readily deployed in widely scales of computers, from single CPU computers to large-scale clusters. A new on-demand blocking checkpoint protocol, which makes use of the reliability mechanism of TCP, is proposed to eliminate the global synchronization. We have demonstrated the effectiveness and efficiency of DCR by multiple MPICH2 applications running on Dawning 5000A.

ieee international conference on high performance computing, data, and analytics | 2009

Multiple Virtual Lanes-aware MPI collective communication in multi-core clusters

Bo Li; Zhigang Huo; Panyong Zhang; Dan Meng

The widespread adoption of multi-core processors in supercomputing arena results in multiple processes in one node competing for the limited resources of the network interface. This is especially true for Collective communication in MPI. InfiniBand, as a prevailing high speed network, provides finegrained Quality of Service (QoS) through Virtual Lanes (VLs) mechanism. In this paper, we study the possibility of enhancing the performance of MPI collective communication by using multiple Virtual Lanes. The utilization of multiple VLs may equalize the priorities of simultaneous send requests, accelerate the transmission of small messages and increase the utilization of network and memory bandwidth. These benefits speed up the MPI Collective communication. Factors that affect the utilization of multiple VLs are disscussed as well. Evaluations show that Alltoall, Reduce, Allreduce and Reduce_scatter operations benefit from our multiple Virtual Lanes aware design with about 10%~20% performance enhancement. Application evaluations show that our design increases the Fast Fourier Transform performance by 11% in the 1024-core cluster.

international parallel and distributed processing symposium | 2005

Impact of page size on communication performance

Xiaocheng Zhou; Zhigang Huo; Ninghui Sun; Yingchao Zhou

In this paper, the impact of page size on the communication performance is studied. In the interconnection communication of cluster system, the address translation table (ATT), which is located in the memory of the network interface card (NIC) and can in a way be seen as the translation look-aside buffer (TLB) used by the NIC processor, is usually used to translate virtual address to physical address by NIC. The page size of operating system not only affects the compulsory and capacity miss rate, but also the hit time and the miss penalty of ATT in some implementations. With a large page size, we can get lower ATT miss rate, shorter hit time and miss penalty to improve the communication performance. To test the impact of the page size, a Linux module based on AMD Opteron/spl trade/ processor is implemented to allocate both normal pages and super pages and the address translation mechanism in Myrinet GM is also extended to support either normal pages or super pages. With super pages, the latency of Ping-pong test can be reduced 4.3 us and the bandwidth can improve 55.3 MB/s in some case. The Linpack test results of 11 TFLOPS Dawning 4000A show that the Linpack efficiency can be increased from 0.66% to 2.86% for different number of processors.

Explore More