Xiufeng Sui | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiufeng Sui is active.

Explore More

Publication

Featured researches published by Xiufeng Sui.

architectural support for programming languages and operating systems | 2015

Supporting Differentiated Services in Computers via Programmable Architecture for Resourcing-on-Demand (PARD)

Jiuyue Ma; Xiufeng Sui; Ninghui Sun; Yupeng Li; Zihao Yu; Bowen Huang; Tianni Xu; Zhicheng Yao; Yun Chen; Haibin Wang; Lixin Zhang; Yungang Bao

This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an applications high-level information like quality-of-service requirements to the hardware. PARD enables new functionalities like fully hardware-supported virtualization and differentiated services in computers. PARD is inspired by the observation that a computer is inherently a network in which hardware components communicate via packets (e.g., over the NoC or PCIe). We apply principles of software-defined networking to this intra-computer network and address three major challenges. First, to deal with the semantic gap between high-level applications and underlying hardware packets, PARD attaches a high-level semantic tag (e.g., a virtual machine or thread ID) to each memory-access, I/O, or interrupt packet. Second, to make hardware components more manageable, PARD implements programmable control planes that can be integrated into various shared resources (e.g., cache, DRAM, and I/O devices) and can differentially process packets according to tag-based rules. Third, to facilitate programming, PARD abstracts all control planes as a device file tree to provide a uniform programming interface via which users create and apply tag-based rules. Full-system simulation results show that by co-locating latencycritical memcached applications with other workloads PARD can improve a four-core computers CPU utilization by up to a factor of four without significantly increasing tail latency. FPGA emulation based on a preliminary RTL implementation demonstrates that the cache control plane introduces no extra latency and that the memory control plane can reduce queueing delay for high-priority memory-access requests by up to a factor of 5.6.

high performance computing and communications | 2013

Rethinking Virtual Machine Interference in the Era of Cloud Applications

Tianni Xu; Xiufeng Sui; Zhicheng Yao; Jiuyue Ma; Yungang Bao; Lixin Zhang

Data centers are increasingly employing virtualization as a means to ensure the performance isolation for latency-sensitive applications while allowing co-locations of multiple applications. Previous research has shown that virtualization could offer excellent resource isolation. However, whether virtualization can mitigate the interference among micro-architectural resources has not been well studied. This paper presents an in-depth analysis of the performance isolation effect of virtualization technology on various micro-architectural resources (i.e., L1 D-Cache, L2 Cache, last level cache (LLC), hardware prefetchers and Non-Uniform Memory Access) by mapping the CloudSuite benchmarks to different sockets, different cores of one chip, and different threads of one core. For each resource, we investigate the correlation between performance variations and contention by changing VM mapping policies according to different application characteristics. Our experiments show that virtualization has rather limited micro-architectural isolation effects. Specifically, LLC interference can degrade applications performance by as much as 28%. When it comes to intra-core resources, the applications performance degradation can be as much as 27%. Additionally, we outline several opportunities to improve performance by reducing misbehavior VM interference.

Journal of Computer Science and Technology | 2015

Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures

Juan Fang; Zhenyu Leng; Sitong Liu; Zhicheng Yao; Xiufeng Sui

Computer architecture is transiting from the multicore era into the heterogeneous era in which heterogeneous architectures use on-chip networks to access shared resources and how a network is configured will likely have a significant impact on overall performance and power consumption. Recently, heterogeneous network on chip (NoC) has been proposed not only to achieve performance comparable to that of the NoCs with buffered routers but also to reduce buffer cost and energy consumption. However, heterogeneous NoC design for heterogeneous GPU-CPU architectures has not been studied in depth. This paper first evaluates the performance and power consumption of a variety of static hot-potato based heterogeneous NoCs with different buffered and bufferless router placements, which is helpful to explore the design space for heterogeneous GPU-CPU interconnection. Then it proposes Unidirectional Flow Control (UFC), a simple credit-based flow control mechanism for heterogeneous NoC in GPU-CPU architectures to control network congestion. UFC can guarantee that there are always unoccupied entries in buffered routers to receive flits coming from adjacent bufferless routers. Our evaluations show that when compared to hot-potato routing, UFC improves performance by an average of 14.1% with energy increased by an average of 5.3% only.

IEEE Transactions on Computers | 2016

WBSP: A Novel Synchronization Mechanism for Architecture Parallel Simulation

Junmin Wu; Xiaodong Zhu; Tao Li; Xiufeng Sui

Parallelization is an efficient approach to accelerate multi-core, multi-processor and cluster architecture simulators. Nevertheless, frequent synchronization can significantly hinder the performance of a parallel simulator. A common practice in alleviating synchronization cost is to relax synchronization using lengthened synchronous steps. However, as a side effect, simulation accuracy deteriorates considerably. Through analyzing various factors contributing to the causality error in lax synchronization, we observe that a coherent speed across all nodes is critical to achieve high accuracy. To this end, we propose wall-clock based synchronization (WBSP), a novel mechanism that uses wall-clock time to maintain a coherent running speed across the different nodes by periodically synchronizing simulated clocks with the wall clock within each lax step. Our proposed method only results in a modest precision loss while achieving performance close to lax synchronization. We implement WBSP in a many-core parallel simulator and a cluster parallel simulator. Experimental results show that at a scale of 32-host threads, it improves the performance of the many-core simulator by 4.3χ on average with less than a 5.5 percent accuracy loss compared to the conservative mechanism. On the cluster simulator with 64 nodes, our proposed scheme achieves an 8.3χ speedup compared to the conservative mechanism while yielding only a 1.7 percent accuracy loss. Meanwhile, WBSP outperforms the recent proposed adaptive mechanism on simulations that exhibit heavy traffic.

international workshop on quality of service | 2014

QBLESS: A case for QoS-aware bufferless NoCs

Zhicheng Yao; Xiufeng Sui; Tianni Xu; Jiuyue Ma; Juan Fang; Sally A. McKee; Binzhang Fu; Yungang Bao

Datacenters consolidate diverse applications to improve utilization. However when multiple applications are co-located on such platforms, contention for shared resources like Networks-on-Chip (NoCs) can degrade the performance of latency-critical online services (high-priority applications). Recently proposed bufferless NoCs have the advantages of requiring less area and power, but they pose challenges in quality-of-service (QoS) support, which usually relies on buffer-based virtual channels (VCs). We propose QBLESS, a QoS-aware bufferless NoC scheme for datacenters. QBLESS consists of two components: a routing mechanism (QBLESS-R) that can substantially reduce flit deflection for high-priority applications, and a congestion-control mechanism (QBLESS-CC) that guarantees performance for high-priority applications and improves overall system throughput. We use trace-driven simulation to model a 64-core system, finding that when compared to BLESS, a previous state-of-the-art bufferless NoC design, QBLESS improves performance of high-priority applications by an average of 33.2%.

computing frontiers | 2013

DCNSim: a unified and cross-layer computer architecture simulation framework for data center network research

Nongda Hu; Binzhang Fu; Xiufeng Sui; Long Li; Tao Li; Lixin Zhang

Within todays large-scale data centers, the inter-node communication is often the major bottleneck. This fact recently blooms the data center network (DCN) research. Since building a real data center is cost prohibitive, most of DCN studies rely on simulations. Unfortunately, state-of-the-art network simulators have limited support for real world applications, which prevents researchers from first-hand investigation. To address this issue, we developed a unified and cross-layer simulation framework, namely the DCNSim. By leveraging the two widely deployed simulators, DCNSim introduces computer architecture solutions into DCN research. With DCNSim, one could run packet-level network simulation driven by commercial applications while varying computer and network parameters, such as CPU frequency, memory access latency, network topology and protocols. With extensive validations, we show that DCNSim could accurately capture performance trends caused by changing computer and network parameters. Finally, we argue that future DCN researches should consider computer architecture factors via several case studies.

international middleware conference | 2016

Understanding the Behavior of Spark Workloads from Linux Kernel Parameters Perspective

Li Wang; Tianni Xu; Jing Wang; Weigong Zhang; Xiufeng Sui; Yungang Bao

Despite a number of innovative computer systems with high capacity memory have been built, the design principles behind an operating system kernel have remained unchanged for decades. We argue that kernel parameters is a kind of special interface of operating system and must be factored into the operation and maintenance of datacenters. To shed some light on the effectiveness of tuning Linux parameters of virtual memory subsystem when running Spark workloads, we evaluate the benchmarks in a simple standalone deploy mode. Our performance results reveal that some of the Linux memory parameters must be carefully set to efficiently support these processing workloads. We hope this work yields insights for datacenter system operators.

The Scientific World Journal | 2014

Ephedrine QoS: An Antidote to Slow, Congested, Bufferless NoCs

Juan Fang; Zhicheng Yao; Xiufeng Sui; Yungang Bao

Datacenters consolidate diverse applications to improve utilization. However when multiple applications are colocated on such platforms, contention for shared resources like networks-on-chip (NoCs) can degrade the performance of latency-critical online services (high-priority applications). Recently proposed bufferless NoCs (Nychis et al.) have the advantages of requiring less area and power, but they pose challenges in quality-of-service (QoS) support, which usually relies on buffer-based virtual channels (VCs). We propose QBLESS, a QoS-aware bufferless NoC scheme for datacenters. QBLESS consists of two components: a routing mechanism (QBLESS-R) that can substantially reduce flit deflection for high-priority applications and a congestion-control mechanism (QBLESS-CC) that guarantees performance for high-priority applications and improves overall system throughput. We use trace-driven simulation to model a 64-core system, finding that, when compared to BLESS, a previous state-of-the-art bufferless NoC design, QBLESS, improves performance of high-priority applications by an average of 33.2% and reduces network-hops by an average of 42.8%.

international symposium on performance analysis of systems and software | 2013

Understanding the implications of virtual machine management on processor microarchitecture design

Xiufeng Sui; Tao Sun; Tao Li; Lixin Zhang

Cloud computing has demonstrated tremendous capability in a wide spectrum of online services. Virtualization provides an efficient solution to the utilization of modern multicore processor systems while affording significant flexibility. The growing popularity of virtualized datacenters motivates deeper understanding of the interactions between virtual machine management and the micro-architecture behaviors of the privileged domain. We argue that these behaviors must be factored into the design of processor microarchitecture in virtualized datacenters. In this work, we use performance counters on modern servers to study the micro-architectural execution characteristics of the privileged domain while performing various VM management operations. Our study shows that todays state-of-the-art processor still has room for further optimizations when executing virtualized cloud workloads, particularly in the organization of last level caches and on-chip cache coherence protocol. Specifically, our analysis shows that: shared caches could be partitioned to eliminate interference between the privileged domain and guest domains; the cache coherence protocol could support a high degree of data sharing of the privileged domain; and cache capacity or CPU utilization occupied by the privileged domain could be effectively managed when performing management workflows to achieve high system throughput.

Archive | 2012