Ming'e Jing | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ming'e Jing is active.

Explore More

Publication

Featured researches published by Ming'e Jing.

international solid-state circuits conference | 2013

A 65nm 39GOPS/W 24-core processor with 11Tb/s/W packet-controlled circuit-switched double-layer network-on-chip and heterogeneous execution array

Peng Ou; Jiajie Zhang; Heng Quan; Yi Li; Maofei He; Zheng Yu; Xueqiu Yu; Shile Cui; Jie Feng; Shikai Zhu; Jie Lin; Ming'e Jing; Xiaoyang Zeng; Zhiyi Yu

With the increasing complexity and variety of applications, programmable multi-core processors are drawing attention due to their high flexibility and low implementation cost, yet their performance and energy efficiency still cannot fulfill the demands of many compute-intensive applications. This paper describes a high-performance energy-efficient 24-core processor for multi-media and communication applications, with the following key features: (1) a packet-controlled circuit-switched double-layer network-on-chip (NoC) which provides 11Tb/s/W energy efficiency with 435Gb/s bisection-bandwidth; (2) a cluster-shared NoC-connected heterogeneous reconfigurable execution array, which can improve the performance of frequently used computations in multimedia and communication applications by over 6×; (3) memory hierarchy improvements, including a multi-page foreground and background register file, and memory splitting and sharing. The processor, implemented in TSMC 65nm CMOS LP and occupying 18.8mm2 (Fig. 3.6.7) operates at 850MHz at 1.2V, with 523mW power dissipation and 39GOPS/W (26pJ/operation) energy efficiency, which is 1.75× better than our former 16-core processor [3].

international solid-state circuits conference | 2012

An 800MHz 320mW 16-core processor with message-passing and shared-memory inter-core communication mechanisms

Zhiyi Yu; Kaidi You; Ruijin Xiao; Heng Quan; Peng Ou; Yan Ying; Haofan Yang; Ming'e Jing; Xiaoyang Zeng

Almost all multicore processors use a shared-memory architecture due to its simple programming model. Recently, however, the message-passing mechanism is also drawing attention due to its potentially better scalability. In this work, we demonstrate that a hybrid communication mechanism supporting both message passing and shared memory can provide both higher performance and energy efficiency. This 16-core processor has 3 key features: (1) A cluster-based hierarchical architecture supporting both shared-memory and message-passing communication. (2) A cache-free memory hierarchy with an extended register file, small private memory and moderate shared memory to avoid complex cache coherence issues and achieve high energy efficiency by keeping data accesses local. (3) A hardware-aided mailbox mechanism to accelerate the synchronization procedure between different processor nodes. With these techniques, our multicore processor can provide high performance for many applications. Chip test results show that its maximum clock frequency is 800MHz and typical power consumption is 320mW, when running basic applications with clock gating at 1.2V at room temperature.

international conference on asic | 2011

An optimized mapping algorithm based on Simulated Annealing for regular NoC architecture

Liulin Zhong; Jiayi Sheng; Ming'e Jing; Zhiyi Yu; Xiaoyang Zeng; Dian Zhou

Network on chip (NoC) architecture is viewed as a potential solution for the interconnect demands of the emerging multi-core systems since it renders the system high performance, flexibility and low-cost. Mapping tasks onto different cores of the network is a critical phase in NoC design because it determines the energy consumption and packet latency. In order to reduce the energy consumption of applications running on multi-core architecture, we propose a new mapping strategy based on Simulated Annealing (SA). By allocating tasks that have big communication volume to adjacent places on the mesh, the proposed method overcomes the shortcoming of blind search in traditional SA. The experiment results reveal that the solutions generated by the proposed algorithm reduce average energy consumption by 56.56% in mapping 16 tasks and 66.32% in mapping 49 tasks compared with traditional Simulated Annealing (SA).1

international symposium on circuits and systems | 2012

Task-binding based branch-and-bound algorithm for NoC mapping

Liyang Zhou; Ming'e Jing; Liulin Zhong; Zhiyi Yu; Xiaoyang Zeng

Network-on-Chip (NoC) architecture is drawing intensive attention since it promises to maintain high performance in handling complex communication issues as the number of on-chip components increases. Mapping a given application onto the multi-core processors on NoC to obtain a high performance is a significant challenge. In this paper, we propose an optimized branch-and-bound (B&B) mapping algorithm to reduce the communication energy or improve the mapping efficiency by binding the tasks together when they have a large communication volume. Experimental results show that the proposed algorithm can achieve high performance in a short time compared with the traditional algorithm. For example, when mapping 64 tasks onto an 8×8 NoC system, with the approximate run time, 14.72% and 64.11% average energy consumption is saved compared with the original B&B and simulated annealing (SA) algorithms, respectively.

international symposium on circuits and systems | 2012

A pure software ldpc decoder on a multi-core processor platform with reduced inter-processor communication cost

Yan Ying; Kaidi You; Liyang Zhou; Heng Quan; Ming'e Jing; Zhiyi Yu; Xiaoyang Zeng

As an error correction code, Low Density Parity Check (LDPC) code has been widely used in various communication standards such as WiMAX and DVB-S2. But these continuously-evolving communication standards and the high development cost and low-flexibility of hardwired ASIC solutions have pushed LDPC researchers to turn to more cost-efficient and flexible implementation, and thus the multi-core processor based implementation of LDPC decoder is gaining increasing attention in the last few years. However, the performance of the multi-core processor based implementation is far below the hardwired ASICs, with one of the key reasons that the cost of communication between processors is very high. Three approaches are proposed in this paper to reduce the communication cost, including: optimized algorithm partitioning to reduce communication traffic, utilizing imbalanced communication between tasks to optimize mapping and reduce overall communication distance, and simplified data sending-receiving mechanism to reduce the cost of identifying received data. By using these approaches, the communication time of the proposed implementation of LDPC decoder only accounts for 12.2% of total decoding time, which generally occupies 50% decoding time in the previously reported LDPC decoders on multi-core processors. And our work can achieve better throughput performance under the same hardware condition compared with other state-of-the-art works.

IEEE Transactions on Circuits and Systems | 2014

A 16-Core Processor With Shared-Memory and Message-Passing Communications

Zhiyi Yu; Ruijin Xiao; Kaidi You; Heng Quan; Peng Ou; Zheng Yu; Maofei He; Jiajie Zhang; Yan Ying; Haofan Yang; Jun Han; Xu Cheng; Zhang Zhang; Ming'e Jing; Xiaoyang Zeng

A 16-core processor with both message-passing and shared-memory inter-core communication mechanisms is implemented in 65 nm CMOS. Message-passing communication is enabled in a 3 × 6 Mesh packet-switched network-on-chip, and shared-memory communication is supported using the shared memory within each cluster. The processor occupies 9.1 mm2 and operates fully functional at a clock rate of 750 MHz at 1.2 V and maximum 800 MHz at 1.3 V. Each core dissipates 34 mW under typical conditions at 750 MHz and 1.2 V while executing embedded applications such as an LDPC decoder, a 3780-point FFT module, an H.264 decoder and an LTE channel estimator.

international symposium on circuits and systems | 2013

Time-Division-Multiplexer based routing algorithm for NoC system

Ming'e Jing; Zhiyi Yu; Xiaoyang Zeng; Liyang Zhou

In this paper, we present a routing algorithm based on the Time-Division-Multiplexer technique for routing table based Network-on-Chip (NoC) routers to decrease the demand of the system bandwidth while ensuring deadlock free. To fully use the communication resources of NoC - channels, banker algorithm is adopted to allocate and recycle the resources, and a weighted maze algorithm is utilized to determine if there is an available path for the current communication process. Experimental results show that the bandwidth requirement with the proposed algorithm decreases by 71.4% compared with the odd-even algorithm.

international conference on asic | 2011

A method of quadratic programming for mapping on NoC architecture

Jiayi Sheng; Liulin Zhong; Ming'e Jing; Zhiyi Yu; Xiaoyang Zeng

Network-on-Chip (NoC) architecture is drawing intensive attention since it promises to maintain high performance in handling complex communication issues as the number of on-chip components increases. An effective method of mapping multitask applications on multicores is necessary to effectively use the NoC potential. In this paper, we propose an approach of quadratic programming (QP) formulation at the first time for the mapping problem, and it can overcome the unacceptable complexity of Integer Linear Programming (ILP) in dealing with problems with large size due to the decrease in the number of variables. Experimental results show that, QP method is at least 10 times faster than ILP method for 20 given benchmarks.

ieee international conference on solid state and integrated circuit technology | 2014

A global-aware bandwidth-constraint routing scheme for Network-on-Chip

Liulin Zhong; Ming'e Jing; Zhiyi Yu; Xiaoyang Zeng

In this paper, we propose a global-aware, bandwidth-constraint, dead-lock free and low cost minimal routing algorithm for Network on Chip. The routing scheme generates routing paths based on the global condition of the network and it is supported by distributed minimal routing tables. Moreover, the proposed routing scheme can be easily adapted to high dimensional meshes and irregular topologies, making it promising for the 3D chips, fault-tolerant and heterogeneous chip multicore processors (CMP). The experimental results show that the proposed routing algorithm decreases the average latency of network by more than 25% compared with the popular routing schemes.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2007

A Novel Optimization Method for Parametric Yield: Uniform Design Mapping Distance Algorithm

Ming'e Jing; Yue Hao; Dian Zhou; Xuan Zeng

A novel algorithm UDMDA for parametric yield optimization of IC is proposed in this paper. The algorithm integrates uniform design (UD) and mapping distance. An effective yet simple measurement of uniformity of a set of points, namely k-nearest neighbor, is suggested in the UD. Compared with the available methods, the proposed algorithm does not need any calculation of gradient and assumption of initial point. Furthermore, this algorithm has a high convergence rate and is not sensitive to the size of circuit. Therefore, it can be utilized to optimize the nominal performance as well as improve parametric yield. The efficiency of this algorithm is illustrated with two circuit examples

Explore More