Xuan-Yi Lin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xuan-Yi Lin is active.

Explore More

Publication

Featured researches published by Xuan-Yi Lin.

international parallel and distributed processing symposium | 2004

A multiple LID routing scheme for fat-tree-based InfiniBand networks

Xuan-Yi Lin; Yeh-Ching Chung; Tai-Yi Huang

Summary form only given. In a cluster system, performance of the interconnection network greatly affects the computation power generated together from all interconnected processing nodes. The network architecture, the interconnection topology, and the routing scheme are three key elements dominating the performance of an interconnection network. InfiniBand architecture (IBA) is a new industry standard architecture. It defines a high-bandwidth, high-speed, and low-latency message switching network that is good for constructing high-speed interconnection networks for cluster systems. Fat-trees are well-adopted as the topologies of interconnection networks because of many nice properties they have. We proposed an m-port n-tree approach to construct fat-tree-based InfiniBand networks. Based on the constructed fat-tree-based InfiniBand networks, we proposed an efficient multiple LID (MLID) routing scheme. The proposed routing scheme is composed of processing node addressing scheme, path selection scheme, and forwarding table assignment scheme. To evaluate the performance of the proposed routing scheme, we have developed a software simulator for InfiniBand networks. The simulation results show that the proposed routing scheme runs well on the constructed fat-tree-based InfiniBand networks and is able to efficiently utilize the bandwidth and the multiple paths that fat-tree topology offers under InfiniBand architecture.

The Journal of Supercomputing | 2007

Hardware supported multicast in fat-tree-based InfiniBand networks

Jiazheng Zhou; Xuan-Yi Lin; Yeh-Ching Chung

Abstract The multicast operation is a very commonly used operation in parallel applications. It can be used to implement many collective communication operations as well. Therefore, its performance will affect parallel applications and collective communication operations. With the hardware supported multicast of the InfiniBand Architecture (IBA), in this paper, we propose a cyclic multicast scheme for fat-tree-based (m-port n-tree) InfiniBand networks. The basic concept of the proposed cyclic multicast scheme is to find the union sets of the output ports of switches in the paths between the source processing node and each destination processing node in a multicast group. Based on the union sets and the path selection scheme, the forwarding table for a given multicast group can be constructed. We implement the proposed multicast scheme along with the OpenSM multicast scheme and the unicast scheme on an m-port n-tree InfiniBand network simulator. Several one-to-many, many-to-many, many-to-all, and all-to-many multicast cases are simulated. The simulation results show that the proposed multicast scheme outperforms the unicast scheme for all simulated cases. For one-to-many case, the performance of the cyclic multicast scheme is the same as that of the OpenSM multicast scheme. For many-to-many and all-to-many cases, the cyclic multicast scheme outperforms the OpenSM multicast scheme. For many-to-all case, the performance of the cyclic multicast scheme is a little better than that of the OpenSM multicast scheme.

network computing and applications | 2006

A Tree-Turn Model for Irregular Networks

Jiazheng Zhou; Xuan-Yi Lin; Yeh-Ching Chung

In this paper, we propose a general turn model, Tree-turn model, for irregular topology. In Tree-turn model, links are classified as either tree or cross and six directions are associated with channels of links. From these six directions, we prohibit some turns such that an efficient deadlock-free routing algorithm, Tree-turn routing, can be derived. There are three phases to construct the Tree-turn routing. First, build up a coordinated tree for a given topology. Second, construct a communication graph of the topology and the corresponding coordinated tree. Third, set up the forwarding table by using the all-pairs shortest path algorithm according to the prohibited turns derived from the Tree-turn model and the directions of the channels in communication graph. To evaluate the performance, we implement the Tree-turn routing algorithm along with the up*/down* routing algorithm and the L-turn routing algorithm on a software simulator. The simulation results show that Tree-turn routing outperforms other two routing algorithms for all test cases

Future Generation Computer Systems | 2014

Master-worker model for MapReduce paradigm on the TILE64 many-core platform

Xuan-Yi Lin; Yeh-Ching Chung

Abstract MapReduce is a popular programming paradigm for processing big data. It uses the master–worker model, which is widely used on distributed and loosely coupled systems such as clusters, to solve large problems with task parallelism. With the ubiquity of many-core architectures in recent years and foreseeable future, the many-core platform will be one of the main computing platforms to execute MapReduce programs. Therefore, it is essential to optimize MapReduce programs on many-core platforms. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where both system and architectural factors should be taken into account. In this paper, we look into the problem by constructing a master–worker model for MapReduce paradigm on the TILE64 many-core platform. We investigate master share and worker share schemes for implementation of a MapReduce library on the TILE64. The theoretical analysis shows that the worker share scheme is inherently better for implementation of MapReduce library on the TILE64 many-core platform.

MTPP'10 Proceedings of the Second Russia-Taiwan conference on Methods and tools of parallel programming multicomputers | 2010

Parallelization of motion JPEG decoder on TILE64 many-core platform

Xuan-Yi Lin; Chung-Yu Huang; Pei-Man Yang; Tai-Wen Lung; Shau-Yin Tseng; Yeh-Ching Chung

The ubiquity of many-core architectures poses challenges to software developers to make scalable software. To parallelize data-intensive applications on a many-core platform, one has to consider both hardware architecture and software characteristics when writing parallel codes. In this paper, we take Motion JPEG decoder as an example data-intensive application and take TILE64 as an example many-core platform. We parallelize the decoder with two different strategies and observe their impact on program performance and scalability. We design two algorithms, READ and WRITE, which differ in the direction of data movement between processor cores. Experimental results show that READ algorithm outperforms WRITE algorithm by 217% when decoding 1080P video on the TILE64 platform. It indicates that the arrangement of data flows in a data-intensive parallel program can have huge impact on program performance and scalability on a many-core platform.

network computing and applications | 2005

Multicast in Fat-Tree-Based InfiniBand Networks

Jiazheng Zhou; Xuan-Yi Lin; Chun-Hsien Wu; Yeh-Ching Chung

The multicast operation is a very commonly used operation in parallel applications. With the hardware supported multicast of the InfiniBand architecture (IBA), we propose a cyclic multicast scheme for fat-tree-based (m-port n-tree) InfiniBand networks. The basic concept of the proposed cyclic multicast scheme is to find the union sets of the output ports of switches in the paths between the source processing node and each destination processing node in a multicast group. Based on the union sets and the path selection scheme, the forwarding table for a given multicast group can be constructed. We implement the proposed multicast scheme along with the OpenSM multicast scheme and the unicast scheme on an m-port n-tree InfiniBand network simulator. The simulation results show that the proposed multicast scheme outperforms the unicast scheme for all simulated cases. For many-to-many and all-to-many cases, the cyclic multicast scheme outperforms the OpenSM multicast scheme. For many-to-all case, the performance of the cyclic multicast scheme is a little better than that of the OpenSM multicast scheme

international conference on parallel processing | 2011

An Efficient Programming Paradigm for Shared-Memory Master-Worker Video Decoding on TILE64 Many-Core Platform

Xuan-Yi Lin; Kuan-Chou Lai; Shau-Yin Tseng; Kuan-Ching Li; Yeh-Ching Chung

The ubiquity of many-core architectures brings challenges in making scalable application software, changing dramatically from the way applications are traditionally developed. Optimization of programs for many-core platforms is a multifaceted problem, where system and architectural factors should be taken into consideration. In this paper, we attack the problem on the aspect of programming paradigm. We propose a hybrid producer-write plus consumer-read shared-memory programming paradigm for implementation of a master-worker video decoder on the TILE64 many-core platform. To evaluate the scalability and performance benefits of different programing paradigms, a Motion JPEG decoder is parallelized using master-worker structure and implemented with combinations of consumer-read programming and producer-write programming. Experimental results show that the proposed implementation obtained competitive performance speedup, scaling well with number of available cores and up to 4 times performance improvement over other implementations on the decoding of a 1080P video.

The Journal of Supercomputing | 2013

Efficient programming paradigm for video streaming processing on TILE64 platform

Xuan-Yi Lin; Kuan-Chou Lai; Kuan-Ching Li; Yeh-Ching Chung

Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel software. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where system and architectural factors should be taken into account. In this paper, we tackle this problem by implementing parallel programs with different available programming paradigms and evaluate application behaviors on TILE64 many-core platform. That is, we investigate a hybrid producer-write plus consumer-read shared memory programming paradigm for the implementation of master–worker video decoder and encoder in the referred many-core platform. Experimental results show that the proposed implementation has achieved competitive performance speedup, scaling well with the number of available cores and up to four times of performance improvement over other implementations on the decoding of sample 1080P video.

parallel computing technologies | 2007

SCRF: a hybrid register file architecture

Jer-Yu Hsu; Yan-Zu Wu; Xuan-Yi Lin; Yeh-Ching Chung

In VLIW processor design, clustered architecture becomes a popular solution for better hardware efficiency. But the inter-cluster communication (ICC) will cause the execution cycles overhead. In this paper, we propose a shared cluster register file (SCRF) architecture and a SCRF register allocation algorithm to reduce the ICC overhead. The SCRF architecture is a hybrid register file (RF) organization composed of shared RF (SRF) and clustered RFs (CRFs). By putting the frequently used variables that need ICCs on SRF, we can reduce the number of data communication of clusters and thus reduce the ICC overhead. The SCRF register allocation algorithm exploits this architecture feature to perform optimization on ICC reduction and spill codes balancing. The SCRF register allocation algorithm is a heuristic based on graph coloring. To evaluate the performance of the proposed architecture and the SCRF register allocation algorithm, the frequently used two-cluster architecture with and without the SRF scheme are simulated on Trimaran. The simulation results show that the performance of the SCRF architecture is better than that of the clustered RF architecture for all test programs in all measured metrics.

Lecture Notes in Computer Science | 2007