Ching-Hsiang Chu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ching-Hsiang Chu is active.

Explore More

Publication

Featured researches published by Ching-Hsiang Chu.

Eurasip Journal on Wireless Communications and Networking | 2011

Improving SCTP Performance by Jitter-Based Congestion Control over Wired-Wireless Networks

Jyh-Ming Chen; Ching-Hsiang Chu; Eric Hsiao-Kuang Wu; Meng-Feng Tsai; Jian-Ren Wang

With the advances of wireless communication technologies, wireless networks gradually become the most adopted communication networks in the new generation Internet. Computing devices and mobile devices may be equipped with multiple wired and/or wireless network interfaces. Stream Control Transmission Protocol (SCTP) has been proposed for reliable data transport and its multihoming feature makes use of network interfaces effectively to improve performance and reliability. However, like TCP, SCTP suffers unnecessary performance degradation over wired-wireless heterogeneous networks. The main reason is that the original congestion control scheme of SCTP cannot differentiate loss events so that SCTP reduces the congestion window inappropriately. In order to solve this problem and improve performance, we propose a jitter-based congestion control scheme with end-to-end semantics over wired-wireless networks. Besides, we solved ineffective jitter ratio problem which may cause original jitter-based congestion control scheme to misjudge congestion loss as wireless loss. Available bandwidth estimation scheme will be integrated into our congestion control mechanism to make the bottleneck more stabilized. Simulation experiments reveal that our scheme (JSCTP) gives prominence to improve performance effectively over wired-wireless networks.

Proceedings of the 6th ACM workshop on Wireless multimedia networking and computing | 2011

A novel congestion control mechanism on tfrc for streaming applications over wired-wireless networks

Yu-Chen Huang; Ching-Hsiang Chu; Eric Hsiao-Kuang Wu

TCP-Friendly Rate Control (TFRC) has been widely adopted for advanced streaming applications over wired-wireless networks. TFRC applies an equation-based rate control scheme to provide smooth sending rate and perceptual quality in streaming applications. However, TFRC tends to malfunction in wireless environment if packet lost event was introduced by poor channel quality but network congestion. Therefore, TFRC cannot provide high quality-of-service for streaming applications over wired-wireless networks. In this paper, we proposed a novel one-way delay jitter based TFRC with end-to-end semantic over wired-wireless networks. This scheme not only provide smooth sending rate and TCP-friendly characteristics like standard TFRC, but also considerable increase the throughput by accurately estimating the available bandwidth in wired-wireless networks with bursty nature of background traffic. Simulation results show that our scheme conducts performance improvement without intrusiveness issue and even if background traffic is bursty over wired-wireless networks.

cluster computing and the grid | 2016

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Ching-Hsiang Chu; Khaled Hamidouche; Akshay Venkatesh; Ammar Ahmad Awan; Dhabaleswar K. Panda

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.

IEEE Systems Journal | 2018

Distributed Topology Control for Energy-Efficient and Reliable Wireless Communications

Min-Te Sun; Ching-Hsiang Chu; Eric Hsiao-Kuang Wu; Chi-Sen Hsiao; Andy An-Kai Jeng

A dense wireless ad hoc network with a high average node degree offers a strong connectivity for packet routing, but at the same time increases the probability of interference between nodes and results in rapid depletion of node energy. To remedy this issue, topology control mechanisms aim at optimizing transmission power of nodes in a wireless ad hoc network to maintain the connectivity of the network, reduce the energy wastage, and increase the network throughput. In this paper, we propose a partially localized topology control algorithm, namely articulation points based topology control (APTC), which effectively saves power consumption in a wireless ad hoc network with a low communication overhead. Unlike the existing topology control protocols, APTC designates articulation points to be initiators and builds a tree of minimum spanning trees to achieve power saving while maintaining network connectivity. The simulation results demonstrate the superiority of APTC over the existing topology control algorithms in terms of power consumption and communication overhead.

international conference on parallel processing | 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Ching-Hsiang Chu; Xiaoyi Lu; Ammar Ahmad Awan; Hari Subramoni; Jahanzeb Maqbool Hashmi; Bracy Elton; Dhabaleswar K. Panda

Broadcast operations (e.g. MPI_Bcast) have been widely used in deep learning applications to exchange a large amount of data among multiple graphics processing units (GPUs). Recent studies have shown that leveraging the InfiniBand hardware-based multicast (IB-MCAST) protocol can enhance scalability of GPU-based broadcast operations. However, these initial designs with IB-MCAST are not optimized for multi-source broadcast operations with large messages, which is the common communication scenario for deep learning applications. In this paper, we first model existing broadcast schemes and analyze their performance bottlenecks on GPU clusters. Then, we propose a novel broadcast design based on message streaming to better exploit IB-MCAST and NVIDIA GPUDirect RDMA (GDR) technology for efficient large message transfer operation. The proposed design can provide high overlap among multi-source broadcast operations. Experimental results show up to 68% reduction of latency compared to state-of-the-art solutions in a benchmark-level evaluation. The proposed design also shows near-constant latency for a single broadcast operation as a system grows. Furthermore, it yields up to 24% performance improvement in the popular deep learning framework, Microsoft CNTK, which uses multi-source broadcast operations; notably, the performance gains are achieved without modifications to applications. Our model validation shows that the proposed analytical model and experimental results match within a 10% range. Our model also predicts that the proposed design outperforms existing schemes for multi-source broadcast scenarios with increasing numbers of broadcast sources in large-scale GPU clusters.

symposium on computer architecture and high performance computing | 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Ching-Hsiang Chu; Khaled Hamidouche; Hari Subramoni; Akshay Venkatesh; Bracy Elton; Dhabaleswar K. Panda

High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source—typically live—to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.

IEEE Sensors Journal | 2016

Efficient Articulation Point Collaborative Exploration for Reliable Communications in Wireless Sensor Networks

Min-Te Sun; Ching-Hsiang Chu; Eric Hsiao-Kuang Wu; Chi-Sen Hsiao

Maintaining connectivity among nodes is one of the critical issues for reliable communications in wireless sensor networks. Unfortunately, the mobility and failure of nodes in wireless sensor networks (WSNs) have made this issue complicated and challenging. To prevent the network from partitioning, prior works focused on identifying articulation points (AP) in a WSN. However, these approaches either produce low AP identification accuracy or introduce expensive communications overhead. In this paper, we propose a novel algorithm, namely, APs collaborative exploration (ACE). The basic idea of ACE is for nodes to identify the loops by locally broadcasting the exploration packets. By examining the exploration packets, each node can efficiently and accurately identify if it is an AP without much communications overhead. Simulation results using both real and synthesized data sets illustrate that ACE not only achieves a high AP identification accuracy comparable with the state of the arts but only incurs a significant 50% lower communication overhead. Consequently, the proposed ACE is energy-efficient and saves up to 160% of energy compared with existing schemes.

international conference on communications | 2014

Measurement of long-distance Wi-Fi connections: An empirical study

Ching-Hsiang Chu; You Ming Chen; Yu Te Huang; Roberto Carvalho; Chiun Chieh Hsu; Ling Jyh Chen

Long-distance Wi-Fi technology has shown promise in several network applications that cannot utilize conventional technologies effectively. Although the network performance of Wi-Fi technology is affected by a number of environmental factors, there is a dearth of long-term, continuous, and systematic studies on the correlations between those factors and the technologys performance. In this study, we deployed a long-distance Wi-Fi testbed on our campus and conducted a one-year experiment. Comprehensive data analysis of the measurement results shows that rainfall is the major weather attribute that affects the network performance of long-distance Wi-Fi links. In addition, the performance is highly correlated to human activities in the immediate vicinity. The results also demonstrate it is possible to infer peoples daily routines on campus by exploiting the long-term measurement data.

arXiv: Distributed, Parallel, and Cluster Computing | 2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Ammar Ahmad Awan; Ching-Hsiang Chu; Hari Subramoni; Dhabaleswar K. Panda

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.

OpenSHMEM 2015 Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 9397 | 2015

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

Ammar Ahmad Awan; Khaled Hamidouche; Ching-Hsiang Chu; Dhabaleswar K. Panda

An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space PGAS programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstraction of a single memory across nodes of a cluster. OpenSHMEM implementations currently implement the OpenSHMEM 1.2 specification that provides interface for one-sided, atomic, and collective operations. However, the recent trend in HPC arena in general, and Message Passing Interface MPI community in specific, is to use Non-Blocking Collective NBC communication to efficiently overlap computation with communication to save precious CPU cycles. This work is inspired by encouraging performance numbers for NBC implementations of various MPI libraries. As the OpenSHMEM community has been discussing the use of non-blocking communication, in this paper, we propose an NBC interface for OpenSHMEM, present its design, implementation, and performance evaluation. We discuss the NBC interface that has been modeled along the lines of MPI NBC interface and requires minimal changes to the function signatures. We have designed and implemented this interface using the Unified Communication Runtime in MVAPICH2-X. In addition, we propose OpenSHMEM NBC benchmarks as an extension to the OpenSHMEM benchmarks available in the widely used OMB suite. Our performance evaluation shows that the proposed NBC implementation provides upi¾?to 96 percent overlap for different collectives with little NBC overhead.

Explore More