Is this you? Create Your Porfile

Sheng Ma

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sheng Ma is active.

Explore More

Publication

Featured researches published by Sheng Ma.

international symposium on computer architecture | 2011

DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip

Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang

With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing algorithms limits performance due to poor network congestion avoidance. Globally adaptive routing algorithms attack this issue by introducing a congestion propagation network to obtain network status information beyond neighboring nodes. However, they may suffer from intra- and inter-application interference during output port selection for consolidated workloads, coupling the behavior of otherwise independent applications and negatively affecting performance. To address these two issues, we propose Destination-Based Adaptive Routing (DBAR). We design a novel low-cost congestion propagation network that leverages both local and non-local network information for more accurate congestion estimates. Thus, DBAR offers effective adaptivity for congestion beyond neighboring nodes. More importantly, by integrating the destination into the selection function, DBAR mitigates intra- and inter-application interference and offers dynamic isolation among regions. Experimental results show that DBAR can offer better performance than the best baseline algorithm for all measured configurations; it is well suited for workload consolidation. The wiring overhead of DBAR is low and DBAR provides improvement in the energy-delay product for medium and high injection rates.

high performance computer architecture | 2012

Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip

Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang

Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design of fully adaptive routing algorithms. First, fully adaptive routing algorithms based on previous deadlock-avoidance theories require a conservative VC re-allocation scheme: a VC can only be re-allocated when it is empty, which limits performance. We propose a novel VC re-allocation scheme, whole packet forwarding (WPF), which allows a non-empty VC to be re-allocated. WPF leverages the observation that the majority of packets in NoCs are short. We prove that WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. WPF is an important extension of previous deadlock-avoidance theories. Second, to efficiently utilize WPF in VC-limited networks, we design a novel fully adaptive routing algorithm which maintains packet adaptivity without significant hardware cost. Compared with conservative VC re-allocation, WPF achieves an average 88.9% saturation throughput improvement in synthetic traffic patterns and an average 21.3% and maximal 37.8% speedup for PARSEC applications with heavy network loads. Our design also offers higher performance than several partially adaptive and deterministic routing algorithms.

high performance computer architecture | 2012

Supporting efficient collective communication in NoCs

Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang

Across many architectures and parallel programming paradigms, collective communication plays a key role in performance and correctness. Hardware support is necessary to prevent important collective communication from becoming a system bottleneck. Support for multicast communication in Networks-on-Chip (NoCs) has achieved substantial throughput improvements and power savings. In this paper, we explore support for reduction or many-to-one communication operations. As a case study, we focus on acknowledgement messages (ACK) that must be collected in a directory protocol before a cache line may be upgraded to or installed in the modified state. This paper makes two primary contributions: an efficient framework to support the reduction of ACK packets and a novel Balanced, Adaptive Multicast (BAM) routing algorithm. The proposed message combination framework complements several multicast algorithms. By combining ACK packets during transmission, this framework not only reduces packet latency by 14.1% for low-to-medium network loads, but also improves the network saturation throughput by 9.6% with little overhead. The balanced buffer resource configuration of BAM improves the saturation throughput by an additional 13.8%. For the PARSEC benchmarks, our design offers an average speedup of 12.7% and a maximal speedup of 16.8%.

IEEE Transactions on Computers | 2012

Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support

Libo Huang; Sheng Ma; Li Shen; Zhiying Wang; Nong Xiao

Binary64 arithmetic is rapidly becoming inadequate to cope with todays large-scale computations due to an accumulation of errors. Therefore, binary128 arithmetic is now required to increase the accuracy and reliability of these computations. At the same time, an obvious trend emerging in modern processors is to extend their instruction sets by allowing single instruction multiple data (SIMD) execution, which can significantly accelerate the data-parallel applications. To address the combined demands mentioned above, this paper presents the architecture of a low-cost binary128 floating-point fused multiply add (FMA) unit with SIMD support. The proposed FMA design can execute a binary128 FMA every other cycle with a latency of four cycles, or two binary64 FMAs fully pipelined with a latency of three cycles, or four binary32 FMAs fully pipelined with a latency of three cycles. We use two binary64 FMA units to support binary128 FMA which requires much less hardware than a fully pipelined binary128 FMA. The presented binary128 FMA design uses both segmentation and iteration hardware vectorization methods to trade off performance, such as throughput and latency, against area and power. Compared with a standard binary128 FMA implementation, the proposed FMA design has 30 percent less area and 29 percent less dynamic power dissipation.

IEEE Transactions on Computers | 2015

Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-Coherent NoCs

Sheng Ma; Zhiying Wang; Zonglin Liu Liu; Natalie D. Enright Jerger

Short and long packets co-exist in cache-coherent NoCs. Existing designs for torus networks do not efficiently handle variable-size packets. For deadlock free operations, a design uses two VCs, which negatively affects the router frequency. Some optimizations use one VC. Yet, they regard all packets as maximum-length packets, inefficiently utilizing the precious buffers. We propose flit bubble flow control (FBFC), which maintains one free flit-size buffer slot to avoid deadlock. FBFC uses one VC, and does not treat short packets as long ones. It achieves both high frequency and efficient buffer utilization. FBFC performs 92.8 and 34.2 percent better than LBS and CBS for synthetic traffic in a 4 × 4 torus. The gains increase in larger networks; they are 107.2 and 40.1 percent in an 8 × 8 torus. FBFC achieves an average 13.0 percent speedup over LBS for PARSEC workloads. Our results also show that FBFC is more power efficient than LBS and CBS, and a torus with FBFC is more power efficient than a mesh.

IEEE Transactions on Parallel and Distributed Systems | 2014

Novel Flow Control for Fully Adaptive Routing in Cache-coherent NoCs

Sheng Ma; Zhiying Wang; Natalie D. Enright Jerger; Li Shen; Nong Xiao

Routing algorithms for cache-coherent NoCs only have limited VCs at their disposal, which poses challenges to the design of routing algorithms. Existing fully adaptive routing algorithms apply conservative VC re-allocation: only empty VCs can be re-allocated, which limits performance. We propose two novel flow control designs. First, whole packet forwarding (WPF) re-allocates a nonempty VC if the VC has enough free buffers for an entire packet. WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. It is an important extension to several deadlock avoidance theories. Second, we extend Duatos theory to apply aggressive VC re-allocation on escape VCs without deadlock. Finally, we propose a design which maintains maximal routing flexibility with low hardware cost. For synthetic traffic, our design performs averagely 88.9 percent better than existing fully adaptive routing. Our design is superior to partially adaptive and deterministic routing.

high-performance computer architecture | 2010

SIF: Overcoming the limitations of SIMD devices via implicit permutation

Libo Huang; Li Shen; Zhiying Wang; Wei Shi; Nong Xiao; Sheng Ma

SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance for multimedia applications. However, there are three remaining limitations to the efficient utilization of SIMD devices in general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an efficient SIMD interface framework that addresses these three shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) and it adds new extended instructions to set internal permutation state in SIMD datapath rather than putting the permutation state setting bits in every instruction. The implicit permutation capability provided by PVRF results in zero overhead, which frees the handling of three limitations by using permutation instructions. To further reduce the state setting instructions in SIMD datapath, a technique that moves the workloads from SIMD pipeline into scalar pipeline is also introduced. With the help of proposed compilation algorithm, SIF can efficiently transform regular SIMD codes into SIF codes which make it easily integrated in all existing SIMD devices. We implemented these techniques in a vectorizing compiler and experimental results show that most of the permutation overhead instructions can be eliminated and distinct performance speedup can be achieved, which is 37% higher than current SIMD techniques on average.

international conference on asic | 2009

DM-SIMD: A new SIMD predication mechanism for exploiting superword level parallelism

Libo Huang; Li Shen; Sheng Ma; Nong Xiao; Zhiying Wang

Predication mechanism is a promising architectural feature for exploiting superword level parallelism (SLP) in presence of control flow. However, for the sake of binary compatibility, current SIMD extension only supports partial predicated execution such as select method which has performance and safety problems. In this paper, we present a new SIMD predication mechanism, data masked SIMD (DM-SIMD), capable of supporting full predication without touching existing ISA. DM-SIMD avoids the high encoding overhead of traditional full predication, and eliminates safety problem raised by partial predication as well. The cornerstone of this mechanism is the “state change” idea which adds new instructions to set internal state in SIMD datapath rather than putting the VM setting bits in every SIMD instruction. To effectively use DM-SIMD facilities for SIMD code generation, the compilation strategies are also proposed. We implemented these techniques in a vectorizing compiler and experiments were conducted on various kinds of applications. The results show that performance speedup, about 20% higher than current SIMD extensions, can be achieved.

IEEE Transactions on Computers | 2014

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs

Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang; Mingche Lai; Libo Huang

To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of routing algorithm design include adaptivity, path selection strategy, VC allocation, isolation, and hardware implementation cost; these design aspects are not independent. The key contribution of this work lies in the design of a novel selection strategy, Destination-Based Selection Strategy (DBSS), which targets interference that can arise in many-core systems running consolidation workloads. In the process of this design, we holistically consider all aspects to ensure an efficient design. Existing routing algorithms largely overlook issues associated with workload consolidation. Locally adaptive algorithms do not consider enough status information to avoid network congestion. Globally adaptive routing algorithms attack this issue by utilizing network status beyond neighboring nodes. However, they may suffer from interference, coupling the behavior of otherwise independent applications. To address these issues, DBSS leverages both local and nonlocal network status to provide more effective adaptivity. More importantly, by integrating the destination into the selection procedure, DBSS mitigates interference and offers dynamic isolation among applications. Results show that DBSS offers better performance than the best baseline selection strategy and improves the energy-delay product for medium and high injection rates; it is well suited for workload consolidation.

Microprocessors and Microsystems | 2011

A practical low-latency router architecture with wing channel for on-chip network

Mingche Lai; Lei Gao; Sheng Ma; Xiao Nong; Zhiying Wang

With increasing number of cores, the communication latency of Network-on-Chip becomes a dominant problem due to complex operations per node. In this paper, we try to reduce communication latency by proposing single-cycle router architecture with wing channel, which forwards the incoming packets to free ports immediately with the inspection of switch allocation results. Also, the incoming packets granted with wing channel can fill in the time-slots of crossbar switch and reduce the contentions with subsequent ones, thereby pushing throughput effectively. We design the proposed router using 65nm CMOS process, and the results show that it supports different routing schemes and outperforms express virtual channel, prediction and Kumars single-cycle ones in terms of latency and throughput. When compared to the speculative router, it provides 45.7% latency reduction and 14.0% throughput improvement. Moreover, we show that the proposed design incurs a modest area overhead of 8.1% but the power consumption is saved by 7.8% due to less arbitration activities.

Explore More