Is this you? Create Your Porfile

Libo Huang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Libo Huang is active.

Explore More

Publication

Featured researches published by Libo Huang.

IEEE Transactions on Computers | 2012

Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support

Libo Huang; Sheng Ma; Li Shen; Zhiying Wang; Nong Xiao

Binary64 arithmetic is rapidly becoming inadequate to cope with todays large-scale computations due to an accumulation of errors. Therefore, binary128 arithmetic is now required to increase the accuracy and reliability of these computations. At the same time, an obvious trend emerging in modern processors is to extend their instruction sets by allowing single instruction multiple data (SIMD) execution, which can significantly accelerate the data-parallel applications. To address the combined demands mentioned above, this paper presents the architecture of a low-cost binary128 floating-point fused multiply add (FMA) unit with SIMD support. The proposed FMA design can execute a binary128 FMA every other cycle with a latency of four cycles, or two binary64 FMAs fully pipelined with a latency of three cycles, or four binary32 FMAs fully pipelined with a latency of three cycles. We use two binary64 FMA units to support binary128 FMA which requires much less hardware than a fully pipelined binary128 FMA. The presented binary128 FMA design uses both segmentation and iteration hardware vectorization methods to trade off performance, such as throughput and latency, against area and power. Compared with a standard binary128 FMA implementation, the proposed FMA design has 30 percent less area and 29 percent less dynamic power dissipation.

high-performance computer architecture | 2010

SIF: Overcoming the limitations of SIMD devices via implicit permutation

Libo Huang; Li Shen; Zhiying Wang; Wei Shi; Nong Xiao; Sheng Ma

SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance for multimedia applications. However, there are three remaining limitations to the efficient utilization of SIMD devices in general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an efficient SIMD interface framework that addresses these three shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) and it adds new extended instructions to set internal permutation state in SIMD datapath rather than putting the permutation state setting bits in every instruction. The implicit permutation capability provided by PVRF results in zero overhead, which frees the handling of three limitations by using permutation instructions. To further reduce the state setting instructions in SIMD datapath, a technique that moves the workloads from SIMD pipeline into scalar pipeline is also introduced. With the help of proposed compilation algorithm, SIF can efficiently transform regular SIMD codes into SIF codes which make it easily integrated in all existing SIMD devices. We implemented these techniques in a vectorizing compiler and experimental results show that most of the permutation overhead instructions can be eliminated and distinct performance speedup can be achieved, which is 37% higher than current SIMD techniques on average.

Iet Computers and Digital Techniques | 2009

Optimal subgraph covering for customisable VLIW processors

Yashuai Lu; Li Shen; Libo Huang; Zhiying Wang; Nong Xiao

It is increasingly common to see the combination of single-issue general purpose processors (GPPs) with extensible VLIW processors in many embedded system designs. Compared with GPPs, extensible VLIW processors can exploit instruction-level parallelism, and they are more suitable for computation-intensive tasks. Moreover, they offer the ability of customising instruction-set extensions (ISEs) for an application domain. Many previous works reveal that automated extension generation can greatly improve both performance and design efficiency of instruction-set extensible processors. One of the key steps of automated extension generation is subgraph selection. Since this problem is at least NP-hard, most previous works rely on greedy approaches to address it, whereas an optimal subgraph mapping methodology that customises ISEs for multi-issue/VLIW extensible processors is presented here. Several effective pruning techniques are proposed to ensure that the proposed methodology is tractable, and the optimal method performs 41.02% better than greedy method on average. Besides the optimal subgraph covering methodology, several techniques are also proposed to reduce the area burden that ISEs impose on the processor.

application specific systems architectures and processors | 2012

Accelerating NoC-Based MPI Primitives via Communication Architecture Customization

Libo Huang; Zhiying Wang; Nong Xiao

Current NoCs are always designed without the consideration of programming models, bringing about a great challenge for exploiting parallelism. In this paper, we present a NoC design that take into account the well-known parallel programming model, message passing interface (MPI), to boost applications by exploiting all hardware features available in the NoC-based multicore architectures. Conventional MPI functions are normally implemented in software due to their enormity and complexity, resulting in large communication latencies. We propose a new hardware implementation of basic MPI primitives. The premise is that all other MPI functions can be efficiently built upon these three MPI primitives. Our design includes two important hardware features: the customized NoC design incorporating virtual buses (VB) into NoCs and the optimized MPI unit (MU) efficiently executing MPI-related transactions. Extensive experimental results have demonstrated that the proposed designs effectively boost the performance of MPI primitives.

great lakes symposium on vlsi | 2012

An optimized multicore cache coherence design for exploiting communication locality

Libo Huang; Zhiying Wang; Nong Xiao

Supporting cache coherence in current multicore processor still faces scalability and performance problems. This paper presents an optimized cache coherence design targeting at NoC-based multicore processors. It tries to achieve the best characteristics both of the snooping and of the directory-based protocols. With the observation of network traffic locality, we design a cache coherence that aims at local and remote access separately. At the first level, snooping is achieved within a cache group and at the second level of the protocol, the coarse directories provide the caches with information about which processors must be involved in first level snooping. To support efficient coherence broadcasting, we also propose a low latency, broadcast-enabled underlying NoC design. It incorporates light weight buses into NoCs, where the snooping protocol can be performed in a broadcast fashion. Extensive experimental results demonstrate that the proposed coherence design can achieve low complexity and high performance goals.

international conference on asic | 2009

DM-SIMD: A new SIMD predication mechanism for exploiting superword level parallelism

Libo Huang; Li Shen; Sheng Ma; Nong Xiao; Zhiying Wang

Predication mechanism is a promising architectural feature for exploiting superword level parallelism (SLP) in presence of control flow. However, for the sake of binary compatibility, current SIMD extension only supports partial predicated execution such as select method which has performance and safety problems. In this paper, we present a new SIMD predication mechanism, data masked SIMD (DM-SIMD), capable of supporting full predication without touching existing ISA. DM-SIMD avoids the high encoding overhead of traditional full predication, and eliminates safety problem raised by partial predication as well. The cornerstone of this mechanism is the “state change” idea which adds new instructions to set internal state in SIMD datapath rather than putting the VM setting bits in every SIMD instruction. To effectively use DM-SIMD facilities for SIMD code generation, the compilation strategies are also proposed. We implemented these techniques in a vectorizing compiler and experiments were conducted on various kinds of applications. The results show that performance speedup, about 20% higher than current SIMD extensions, can be achieved.

complex, intelligent and software intensive systems | 2008

Memory System Design for a Multi-core Processor

Jianjun Guo; Mingche Lai; Zhengyuan Pang; Libo Huang; Fangyuan Chen; Kui Dai; Zhiying Wang

Multi-core processor has become hot research area recently. Cache results in high cost to maintain consistency between different data copies in multi-core processor especially in many-core processor. A hybrid memory architecture is proposed for the specific multi-core processor which uses cache for instruction while local storage for data. This paper focuses on the design and optimization of the proposed memory architecture. L1 instruction cache, local data storage, DMA engine, L2 cache and MMU is designed and optimized. L2 cache replacement strategy is studied to reduce the total miss cost.

IEEE Transactions on Computers | 2014

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs

Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang; Mingche Lai; Libo Huang

To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of routing algorithm design include adaptivity, path selection strategy, VC allocation, isolation, and hardware implementation cost; these design aspects are not independent. The key contribution of this work lies in the design of a novel selection strategy, Destination-Based Selection Strategy (DBSS), which targets interference that can arise in many-core systems running consolidation workloads. In the process of this design, we holistically consider all aspects to ensure an efficient design. Existing routing algorithms largely overlook issues associated with workload consolidation. Locally adaptive algorithms do not consider enough status information to avoid network congestion. Globally adaptive routing algorithms attack this issue by utilizing network status beyond neighboring nodes. However, they may suffer from interference, coupling the behavior of otherwise independent applications. To address these issues, DBSS leverages both local and nonlocal network status to provide more effective adaptivity. More importantly, by integrating the destination into the selection procedure, DBSS mitigates interference and offers dynamic isolation among applications. Results show that DBSS offers better performance than the best baseline selection strategy and improves the energy-delay product for medium and high injection rates; it is well suited for workload consolidation.

parallel computing | 2013

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Libo Huang; Nong Xiao; Zhiying Wang; Yongwen Wang; Mingche Lai

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.

international symposium on circuits and systems | 2010

Permutation optimization for SIMD devices

Libo Huang; Li Shen; Zhiying Wang

Single-instruction-multiple-data (SIMD) devices have been widely incorporated into baseline instruction level parallelism (ILP) processors to enable more efficient data level parallelism (DLP) support. This paper addresses the unsolved problem of the need to permute the SIMD elements packed in registers for maximum parallelism performance. An implicit data permutation (IDP) mechanism is proposed for handling various permutation operations without performance overhead. Various ways can be used to implement IDP mechanism. One way is to modify the baseline processors with permutation vector register file (PVRF) and associated new extended instructions. The PVRF allows accessing the data by using permutation pattern in addition to the existing row pattern. This method is described in detail and experimental results show that distinct performance speedup can be achieved, which is 47% higher than current SIMD techniques on average.

Explore More