Is this you? Create Your Porfile

Mingche Lai

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mingche Lai is active.

Explore More

Publication

Featured researches published by Mingche Lai.

design automation conference | 2008

A dynamically-allocated virtual channel architecture with congestion awareness for on-chip routers

Mingche Lai; Zhiying Wang; Lei Gao; Hongyi Lu; Kui Dai

In this paper, the dynamically-allocated virtual channels (VCs) architecture with congestion awareness is introduced. All the buffers are shared among VCs whose structure varies with traffic condition. In low rate, this structure extends VC depth for continual transfers to reduce packet latencies. In high rate, it dispenses many VCs and avoids congestion situations to improve the throughput. We modify the VC controller and VC allocation modules, while designing simple congestion avoidance logic. The experiment shows that the proposed routers outperform conventional ones under different traffic patterns. They provide 8.3% throughput increase and 19.6% latency decrease while saving 27.4% of area and 28.6% of power.

international conference on computer aided design | 2009

An accurate and efficient performance analysis approach based on queuing model for Network on Chip

Mingche Lai; Lei Gao; Nong Xiao; Zhiying Wang

An accurate and highly-efficient performance analysis approach is extremely important for the early-stage designs of network-on-chip. In this paper, the novel M/G/1/N queuing models for generic routers are proposed to analyze various packet blockings and then the performance analysis algorithm is presented to estimate some key metrics in terms of packet latency, buffer utilization, etc. For single-channel and multi-channel routers, the comparisons between analysis and observed results validate that the proposed approach with mean errors of 6.9% and 7.8% achieve the speed-ups of 240 and 210 times respectively. In our design methodology, this approach can not only effectively direct NoC synthesis process but also be conveniently applied to multi-objective optimizations to find the best mapping solutions. Categories and Subject Descriptors: B.4.3 [Hardware]: Input/Output and Data Communication-Interconnections. General Terms: Algorithms, Performance.

complex, intelligent and software intensive systems | 2008

Memory System Design for a Multi-core Processor

Jianjun Guo; Mingche Lai; Zhengyuan Pang; Libo Huang; Fangyuan Chen; Kui Dai; Zhiying Wang

Multi-core processor has become hot research area recently. Cache results in high cost to maintain consistency between different data copies in multi-core processor especially in many-core processor. A hybrid memory architecture is proposed for the specific multi-core processor which uses cache for instruction while local storage for data. This paper focuses on the design and optimization of the proposed memory architecture. L1 instruction cache, local data storage, DMA engine, L2 cache and MMU is designed and optimized. L2 cache replacement strategy is studied to reduce the total miss cost.

IEEE Transactions on Computers | 2014

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs

Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang; Mingche Lai; Libo Huang

To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of routing algorithm design include adaptivity, path selection strategy, VC allocation, isolation, and hardware implementation cost; these design aspects are not independent. The key contribution of this work lies in the design of a novel selection strategy, Destination-Based Selection Strategy (DBSS), which targets interference that can arise in many-core systems running consolidation workloads. In the process of this design, we holistically consider all aspects to ensure an efficient design. Existing routing algorithms largely overlook issues associated with workload consolidation. Locally adaptive algorithms do not consider enough status information to avoid network congestion. Globally adaptive routing algorithms attack this issue by utilizing network status beyond neighboring nodes. However, they may suffer from interference, coupling the behavior of otherwise independent applications. To address these issues, DBSS leverages both local and nonlocal network status to provide more effective adaptivity. More importantly, by integrating the destination into the selection procedure, DBSS mitigates interference and offers dynamic isolation among applications. Results show that DBSS offers better performance than the best baseline selection strategy and improves the energy-delay product for medium and high injection rates; it is well suited for workload consolidation.

parallel computing | 2013

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Libo Huang; Nong Xiao; Zhiying Wang; Yongwen Wang; Mingche Lai

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.

acm symposium on applied computing | 2008

Hierarchical memory system design for a heterogeneous multi-core processor

Jianjun Guo; Mingche Lai; Zhengyuan Pang; Libo Huang; Fangyuan Chen; Kui Dai; Zhiying Wang

Multi-core architecture has become hot issue recently both for performance and power consideration. Memory system is the bottleneck under this circumstance. A multi-core architecture using simple cores based on transport triggered architecture is proposed. This architecture has a uniform programming view. The memory system design exploration and optimization is done and a hierarchical memory system is designed. A balanced memory bandwidth is provided to the multi-core architecture.

Science in China Series F: Information Sciences | 2013

An accurate and highly-efficient performance evaluation approach based on queuing model for on-chip network

Mingche Lai; Lei Gao; Nong Xiao; Zhiying Wang

An accurate and highly-efficient analysis approach is crucial for a designer to evaluate the performance of on-chip networks. To this end, the novel M/G/1/N queuing models that capture various blocking phenomenon of the wormhole switching router are presented to compute the average waiting time accurately. With the M/G/1/N queuing models, this paper then presents the performance analysis algorithm to estimate some key metrics in terms of packet latency, buffer utilization, etc. The comparisons with SystemC simulated results show that the proposed approach with mean errors of 6.9% and 7.8% achieves the speedups of 117 and 101 for single-channel and multi-channel routers respectively. In our design methodology, this approach can direct NoC synthesis process effectively and then can be applied to multi-objective optimizations conveniently to find the best mappings.

Microprocessors and Microsystems | 2011

A practical low-latency router architecture with wing channel for on-chip network

Mingche Lai; Lei Gao; Sheng Ma; Xiao Nong; Zhiying Wang

With increasing number of cores, the communication latency of Network-on-Chip becomes a dominant problem due to complex operations per node. In this paper, we try to reduce communication latency by proposing single-cycle router architecture with wing channel, which forwards the incoming packets to free ports immediately with the inspection of switch allocation results. Also, the incoming packets granted with wing channel can fill in the time-slots of crossbar switch and reduce the contentions with subsequent ones, thereby pushing throughput effectively. We design the proposed router using 65nm CMOS process, and the results show that it supports different routing schemes and outperforms express virtual channel, prediction and Kumars single-cycle ones in terms of latency and throughput. When compared to the speculative router, it provides 45.7% latency reduction and 14.0% throughput improvement. Moreover, we show that the proposed design incurs a modest area overhead of 8.1% but the power consumption is saved by 7.8% due to less arbitration activities.

Iet Computers and Digital Techniques | 2010

Exploration and implementation of a highly efficient processor element for multimedia and signal processing domains

Mingche Lai; Lei Gao; Zhiying Wang

The exploration and design process of highly efficient processor element for multimedia and signal processing domains is presented in this study. With the introduction of synchronous data-transfer architecture for high-performance embedded applications, the effectively exploring the exponential-size architectural design spaces by detailed simulation is intractable. The authors attack this via an automated approach. At first, its cost model is built to achieve fast and accurate estimation with the characteristic of scalability. Then, the hierarchical design space exploration methodology involving heuristic-based local process and analytical global optimisation step is proposed to achieve the approximate optimum with short time-to-market. For target domains, our proposed method arrives at better optimised results within only 25 h when compared with other methods. A System on Chip (SoC) involving the optimised processor element has been implemented in 0.13 m complementary metal oxide semiconductor (CMOS) process and the experimental results show that our processor element outperforms TMS320C64 series and does the obvious acceleration to multimedia applications in SoC system.

Archive | 2011

A Low-Latency Virtual-Channel Router with Optimized Pipeline Stages for On-Chip Network

Shanshan Ren; Zhiying Wang; Mingche Lai; Hongyi Lu

Network-on-Chip (NoC), as the next generation interconnection technology on chip, facilitates a high-bandwidth communication and improves the performance of communication observably. Since the NoC performance closely relies on the intra-router latency, a low-latency virtual-channel router with optimized pipeline stage is proposed in this study. By utilizing look-ahead routing algorithm and speculation switch allocation strategy, the pipeline depth is shortened to two stages. To further shorten the critical path of designs, this study also proposes a three-way parallel arbitration mechanism when designing low-latency virtual-channel allocator and speculative switch allocator. The experiment results show the allocators present the good scalability to different virtual channel and physical port numbers and the pipeline stage delay is reduced by 33% compared with the traditional one. Then, the proposed router can provide throughput increase and latency decrease with 19 and 25%, respectively, compared to the traditional router.

Explore More