Xuejun An | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xuejun An is active.

Explore More

Publication

Featured researches published by Xuejun An.

Journal of Computer Science and Technology | 2014

An Intra-Server Interconnect Fabric for Heterogeneous Computing

Zheng Cao; Xiaoli Liu; Qiang Li; Xiaobing Liu; Zhan Wang; Xuejun An

With the increasing diversity of application needs and computing units, the server with heterogeneous processors is more and more widespread. However, conventional SMP/ccNUMA server architecture introduces communication bottleneck between heterogeneous processors and only uses heterogeneous processors as coprocessors, which limits the efficiency and flexibility of using heterogeneous processors. To solve this problem, this paper proposes an intra-server interconnect fabric that supports both intra-server peer-to-peer interconnection and I/O resource sharing among heterogeneous processors. By connecting processors and I/O devices with the proposed fabric, heterogeneous processors can perform direct communication with each other and run in stand-alone mode with shared intra-server resources. We design the proposed fabric by extending the de-facto system I/O bus protocol PCIe (Peripheral Computer Interconnect Express) and implement it with a single chip cZodiac. By making full use of PCIe’s original advantages, the interconnection and the I/O sharing mechanism are light weight and efficient. Evaluations that have been carried out on both the FPGA (Field Programmable Gate Array) prototype and the cycle-accurate simulator demonstrate that our design is feasible and scalable. In addition, our design is suitable for not only the heterogeneous server but also the high density server.

international conference on cluster computing | 2011

Design of HPC Node with Heterogeneous Processors

Zheng Cao; Hongwei Tang; Qiang Li; Bo Li; Fei Chen; Kai Wang; Xuejun An; Ninghui Sun

Heterogeneous Computing is becoming an important technology trend in HPC, where more and more heterogeneous processors are used. However, in traditional node architecture, heterogeneous processors are always used as coprocessors. Such usage increases the communication latency between heterogeneous processors and prevents the node from achieving high density. With the purpose of improving communication efficiency between heterogeneous processors, this paper proposed a new node architecture named HeteNode. In HeteNode, general purpose processors and heterogeneous processors are interconnected by a system controller directly and play the same role in both process of communication and process of computation. The prototype of HeteNode which contains nine processors in 1U chassis is built. Evaluation carried out on the prototype shows that 580ns minimum intra-node latency and 1.78us minimum inter-node latency between heterogeneous processors are achieved. Besides, NPB benchmarks show good scalability in HeteNode.

networking architecture and storages | 2009

Gemini NI: An Integration of Two Network Interfaces

Kai Wang; Xiaomin Li; Xuejun An; Ninghui Sun

According to the development of the TOP500, the performance of the high performance computers (HPCs) is increasing rapidly. The incredible performance increment of the HPCs should be largely attributed to the development of their communication systems, because the HPCs cannot extend to such a large scale without their excellent communication systems. As an important member of the communication system, the network interface (NI) always plays a significant role. Since the network interface locates on the critical path of the communication system, it can easily become a bottleneck if it cannot provide low latency and high bandwidth for communication. The Gemini NI presented in this paper has a good performance potential in both latency and bandwidth. It has a remote load/store (RLS) mechanism which can provide ultra low latency. Furthermore, it has two HyperTransport (HT) interfaces connected to double processors or symmetric multi-processors (SMP), and it has four proprietary switch interfaces connected to the switches. This approach largely in-creases the throughput of the Gemini NI. Inside the Gemini NI, almost all the components of a network interface are duplicated. The resource sharing within the Gemini NI can provide great flexibility for scheduling. A FPGA prototype of the Gemini NI has been implemented, and the preliminary results prove the validity of our design.

international conference on parallel and distributed systems | 2014

Building a large-scale direct network with low-radix routers

Yong Su; Zheng Cao; Zhiguo Fan; Zhan Wang; Xiaoli Liu; Xiaobing Liu; Li Qiang; Xuejun An; Ninghui Sun

Communication locality is an important characteristic of parallel applications. A great deal of research shows that utilizing the characteristic will favor most applications. Aiming at communication locality, we present a hierarchical direct network topology to accelerate neighbor communication. Combining mesh topology and complete graph topology, it can be used to optimize local communication and build large-scale network with low radix routers. Analyzing the characteristic of hierarchical topology, we find the presented topology has high cost performance and excellent expandability. We also design two minimum path routing algorithms and compare them with Mesh, Dragonfly and PERCS topologies. The results show the saturated throughput of hierarchical topology is nearly 40% with uniform random trace and 70% with local communication model of 4K nodes. That indicates high scalability for applications with local communication and cost efficiency for uniform random trace.

computing frontiers | 2014

Accelerating synchronization communications for high-density blade enclosure

Zheng Cao; Fei Chen; Xuejun An; Qiang Li

The high-density blade server provides an attractive solution for the rapidly increasing demand on computing. The degree of parallelism inside a blade enclosure today has reached up to hundreds of cores. In such parallelism, it is necessary to accelerate synchronization operations. In order to accelerate intra-enclosure synchronization operations, this paper proposes a single chip design SyncRouter on the midplane of the blade enclosure. The architecture of SyncRouter is somewhat like a microprocessor whose memory system is of tag-value structure. We call such a memory system Shared Synchronization Memory (SSM). In this paper, both the architecture and usage of the proposed SyncRouter are introduced in detail. We also build a blade enclosure with the SyncRouter implemented in Xilinx XC6LX365T FPGA. Evaluations using both micro-benchmarks and benchmarks are performed on the blade enclosure. The latency of one pair of ssm_put and ssm_get and the minimum latency of ssm_barrier are 0.62 μ s, and the minimum latency of ssm_reduce is 0.81 μ s. Regarding the benchmarks 2D Wave-front and LU, the speedup of using the fine-grained synchronization primitive ssm_put/ssm_get outperforms the one of using the ssm_barrier by 20% on average.

parallel and distributed computing applications and technologies | 2013

cHPP controller: A High Performance Hyper-node Hardware Accelerator

Yong Su; Feilong Liu; Zheng Cao; Zhan Wang; Xiaoli Liu; Xuejun An; Ninghui Sun

The high-density blade server provides an attractive solution for the rapid increasing demand on computing. The degree of parallelism inside a blade enclosure nowadays has reach up to hundreds of cores. In such parallelism, it is necessary to accelerate communications inside a blade enclosure. However, commercial products seldom set foot in the optimization based on hardware. A hyper-node controller is proposed to provide a low overhead and high performance interconnection based on PCIe, which supports global address space, user-level communication, and efficient communication primitives. Furthermore, the efficient sharing of I/O resource is another goal of this design. The prototype of the hyper-node controller is implemented in FPGA. The testing results show the lowest latency is only 1.242us and the highest bandwidth is 3.19GB/s, which is almost 99.7% of the theoretic peak bandwidth.

international conference on computer communications and networks | 2013

Accelerating Allreduce Operation: A Switch-Based Solution

Nongda Hu; Dawei Wang; Zheng Cao; Xuejun An; Ninghui Sun

Collective operations, such as all reduce, are widely treated as the critical limiting factors in achieving high performance in massively parallel applications. Conventional host-based implementations, which introduce a large amount of point-to-point communications, are less efficient in large-scale systems. To address this issue, we propose a design of switch chip to accelerate collective operations, especially the allreduce operation. The major advantage of the proposed solution is the high scalability since expensive point-to-point communications are avoided. Two kinds of allreduce operations, namely block-allreduce and burst-allreduce, are implemented for short and long messages, respectively. We evaluated the proposed design with both a cycle-accurate simulator and a FPGA prototype system. The experimental results prove that switch-based allreduce implementation is quite efficient and scalable, especially in large-scale systems. In the prototype, our switch-based implementation significantly outperforms the host-based one, with a 16 times improvement in MPI time on 16 nodes. Furthermore, the simulation shows that, upon scaling from 2 to 4096 nodes, the switch-based allreduce latency only increases slightly by less than 2 us.

parallel and distributed computing: applications and technologies | 2010

HPP Controller: A System Controller Dedicated for Message Passing

Kai Wang; Fei Chen; Zheng Cao; Xuejun An; Ninghui Sun

The traditional system controller in symmetric multi-processors (SMP) controls the memory, so it is suitable for the shared memory programming model. With the emergence of the processors which integrate memory controllers, the system controller seems less important than before. However, since the system controller resides in the center of a computer system, it acts as an artery which directly connects to the processors and the high-speed IO devices. Thus making full use of its position advantage can no doubt gain performance enhancement. By now, the message passing programming model has dominated the high performance computing (HPC) field, however the system controller makes little contribution to it. Thus, a system controller called HPP controller which is dedicated for the message passing programming model is presented in this paper. The HPP controller is connected to several processors simultaneously, and the communication between these processors uses the message passing programming model. The HPP controller has powerful DMA engines embedded which can provide flexible and sufficient message passing capability. Two key techniques: supporting arbitrary byte alignment and virtualizing the DMA engine are introduced in detail. The preliminary result of the FPGA prototype shows that the HPP controller has ultra low hardware latency and relatively high bandwidth. Besides, the NPB result shows that it can provide high efficiency for the message passing programming model.

high performance computing and communications | 2010

Adding an Expressway to Accelerate the Neighborhood Communication

Kai Wang; Fei Chen; Zheng Cao; Xuejun An; Ninghui Sun

The blade system is very popular in high performance computing. In a blade system, the blade is a fundamental element in which are symmetric multi-processors (SMP). About ten blades constitute a blade box, several blade boxes constitute a cabinet and some cabinets constitute a blade system at last. The blades in a blade box are neighbors because they have relatively short distance. Programmers always try to place the tightly related processes into the same blade box. However, there’s seldom any optimization made by hardware to accelerate the communication in a blade box. Thus, a single chip design called hyper-node controller is presented to provide ultra low latency and high bandwidth which resembles an expressway between neighbors. All the nodes in a blade box can act as a single hyper node by using the hyper-node controller. It is apparent that the additional controller is a useful supplement to efficiently enhance the communication in a blade box and finally enhance the entire blade system. A FPGA prototype of the hyper-node controller has been implemented and it can connect five blades simultaneously. In the preliminary performance evaluation, the latency for an 8-byte payload between two blades is less than 1us, 1.33GB/s which is nearly 94% of the peak effective bandwidth can be obtained by transferring messages with a payload of only 256 bytes.

Frontiers of Computer Science in China | 2010

HPP controller: a system controller for high performance computing

Fei Chen; Zheng Cao; Kai Wang; Xuejun An; Ninghui Sun

This paper introduces the design of a hyper parallel processing (HPP) controller, which is a system controller used in heterogeneous high performance computing systems. It connects several heterogeneous processors via HyperTransport (HT) interfaces, a commercial Infiniband HCA card with PCI-express interface, and a customized global synchronization network with self-defined high-speed interface. To accelerate intra-node communication and synchronization, global address space is supported and some dedicated hardware is integrated in the HPP controller to enable intra-node memory and shared I/O resources. On the prototype system with the HPP controller, evaluation results show that the proposed design achieves high communication efficiency, and obvious acceleration to synchronization operations.

Explore More