Is this you? Create Your Porfile

Yang Guo

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yang Guo is active.

Explore More

Publication

Featured researches published by Yang Guo.

Journal of Electrical and Computer Engineering | 2015

Performance analysis of homogeneous on-chip large-scale parallel computing architectures for data-parallel applications

Xiaowen Chen; Zhonghai Lu; Axel Jantsch; Shuming Chen; Yang Guo; Shenggang Chen; Hu Chen

On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniformand hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance models analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.

IEICE Electronics Express | 2014

Cooperative communication for efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs

Xiaowen Chen; Zhonghai Lu; Axel Jantsch; Shuming Chen; Yang Guo; Hengzhu Liu

On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventi ...

IEICE Electronics Express | 2017

A novel power-efficient IC test scheme

Ding Deng; Xiaowen Chen; Yang Guo

A novel power-efficient IC test scheme is proposed, containing parallel test application (PTA) architecture and its procedure. PTA parallelizes the stimuli assignments and the vectors can be observed immediately once applied, which assures the shift safety timely and hence only logic test is required. The procedure contains two phases for each pattern. In shift phase, each clock chain is activated in turn and the vectors are assigned in parallel. In capture phase, all chains are captured simultaneously. Experimental results demonstrate that, compared with the traditional serial scan scheme, the proposal reduces average power by 88.48% and peak power by 53.36%.

IEICE Electronics Express | 2015

Express Ring: A Multi-layer and Non-blocking NoC Architecture

Chen Li; Sheng Ma; Shenggang Chen; Yang Guo; Peng Wang

As the Network-on-Chip (NoC) induces significant hardware overheads, it becomes the performance and scalability bottleneck of Systemon-Chip (SoC) design. To address this challenge, we propose a multi-layer, non-blocking ring NoC architecture. Multi-layer links with different bandwidth achieve high link utilization and avoid protocol-level deadlock. The non-blocking architecture leverages bufferless router to reduce hardware overheads and simplifies router pipeline to reduce zero-load latency. We also propose a scalable global signal control mechanism to eliminate the starvation and avoid the loss of packets. Compared with the conventional ring network composed of dateline routers (DRing) and Intel Nehalem-EX ring network (NRing), our design achieves 69.4% and 12.3% performance improvements, respectively. Compared with DRing, it also reduces hardware overheads.

IEICE Electronics Express | 2012

A novel parallel memory organization supporting multiple access types with matched memory modules

Sheng Liu; Shuming Chen; Hu Chen; Yang Guo

This paper introduces a Bilinear Skewed Parallel Memory (BilisPM), which can support multiple conflict-free access types and the circular addressing in X-Y directions of the 2D space. BilisPM features matched Memory Modules (MMs) and can effectively save the on-chip area. We introduce the formal specifications of BilisPM and give its hardware implementation. Experimental results show that BilisPM can reduce the chip area by 22.7% on average (38.1% at most), and its controller consumes smaller chip area at reasonable critical path delay, as compared with the traditional schemes with unmatched MMs.

networks on chips | 2017

Fairness-Oriented and Location-Aware NUCA for Many-Core SoC

Zicong Wang; Xiaowen Chen; Chen Li; Yang Guo

Non-uniform cache architecture (NUCA) is often employed to organize the last level cache (LLC) by Networks-on-Chip (NoC). However, along with the scaling up for network size of Systems-on-Chip (SoC), two trends gradually begin to emerge. First, the network latency is becoming the major source of the cache access latency. Second, the communication distance and latency gap between different cores is increasing. Such gap can seriously cause the network latency imbalance problem, aggravate the degree of non-uniform for cache access latencies, and then worsen the system performance. In this paper, we propose a novel NUCA-based scheme, named fairness-oriented and location-aware NUCA (FL-NUCA), to alleviate the network latency imbalance problem and achieve more uniform cache access. We strive to equalize network latencies which are measured by three metrics: average latency (AL), latency standard deviation (LSD), and maximum latency (ML). In FL-NUCA, the memory-to-LLC mapping and links are both non-uniform distributed to better fit the network topology and traffics, thereby equalizing network latencies from two aspects, i.e., non-contention latencies and contention latencies, respectively. The experimental results show that FL-NUCA can effectively improve the fairness of network latencies. Compared with the traditional static NUCA (S-NUCA), in simulation with synthetic traffics, the average improvements for AL, LSD, and ML are 20.9%, 36.3%, and 35.0%, respectively. In simulation with PARSEC benchmarks, the average improvements for AL, LSD, and ML are 6.3%, 3.6%, and 11.2%, respectively.

international conference on computer design | 2016

DLL: A dynamic latency-aware load-balancing strategy in 2.5D NoC architecture

Chen Li; Sheng Ma; Lu Wang; Zicong Wang; Xia Zhao; Yang Guo

As the 3D stacking technology still faces several challenges, the 2.5D stacking technology gains better application prospects nowadays. With the silicon interposer, the 2.5D stacking can improve the bandwidth and capacity of the memory system. To satisfy the communication requirements of the integrated memory system, the free routing resources in the interposer should be explored to implement an additional network. Yet, the performance is strongly limited by the unbalanced loads between the CPU-layer network and the interposer-layer network. In this paper, to address this issue, we propose a dynamic latency-aware load-balancing (DLL) strategy. Our key innovations are detecting congestion of the network layer via the average latency of recent packets and making the network layer selection at each source node. We leverage the free routing resources in the interposer to implement a latency propagation ring. With the ring, the latency information tracked at destination nodes is propagated back to source nodes. We achieve load-balance by using these information. Experimental results show that compared with the baseline design, a destination-detection strategy and a buffer-aware strategy, our DLL strategy achieves 45%, 14.9% and 6.5% of average throughput improvements with minor overheads.

Conference on Advanced Computer Architecture | 2016

Overcoming and Analyzing the Bottleneck of Interposer Network in 2.5D NoC Architecture

Chen Li; Zicong Wang; Lu Wang; Sheng Ma; Yang Guo

As there are still a lot of challenges on 3D stacking technology, 2.5D stacking technology seems to have better application prospects. With the silicon interposer, the 2.5D stacking can improve the bandwidth and capacity of memory. Moreover, the interposer can be explored to make use of unused routing resources and generates an additional network for communication. In this paper, we conclude that using concentrated Mesh as the topology of the interposer network faces the bottleneck of edge portion, while using Double-Butterfly can overcome this bottleneck. We analyze the reasons that pose the bottleneck, compare impacts of different topologies on bottlenecks and propose design goals for the interposer network.

ieee computer society annual symposium on vlsi | 2015

Achieving Memory Access Equalization Via Round-Trip Routing Latency Prediction in 3D Many-Core NoCs

Xiaowen Chen; Zhonghai Lu; Yang Li; Axel Jantsch; Xueqian Zhao; Shuming Chen; Yang Guo; Zonglin Liu; Jianzhuang Lu; Jianghua Wan; Shuwei Sun; Shenggang Chen; Hu Chen

3D many-core NoCs are emerging architectures for future high-performance single chips due to its integration of many processor cores and memories by stacking multiple layers. In such architecture, because processor cores and memories reside in different locations (center, corner, edge, etc.), memory accesses behave differently due to their different communication distances, and the performance (latency) gap of different memory accesses becomes larger as the network size is scaled up. This phenomenon may lead to very high latencies suffered from by some memory accesses, thus degrading the system performance. To achieve high performance, it is crucial to reduce the number of memory accesses with very high latencies. However, this should be done with care since shortening the latency of one memory access can worsen the latency of another as a result of shared network resources. Therefore, the goal should focus on narrowing the latency difference of memory accesses. In the paper, we address the goal by proposing to prioritize the memory access packets based on predicting the round-trip routing latencies of memory accesses. The communication distance and the number of the occupied items in the buffers in the remaining routing path are used to predict the round-trip latency of a memory access. The predicted round-trip routing latency is used as the base to arbitrate the memory access packets so that the memory access with potential high latency can be transferred as early and fast as possible, thus equalizing the memory access latencies as much as possible. Experiments with varied network sizes and packet injection rates prove that our approach can achieve the goal of memory access equalization and outperforms the classic round-robin arbitration in terms of maximum latency, average latency, and LSD1. In the experiments, the maximum improvement of the maximum latency, the average latency and the LSD are 80%, 14%, and 45% respectively.

IEICE Electronics Express | 2014

An efficient floating-point multiplier for digital signal processors

Zonglin Liu; Sheng Ma; Yang Guo

The floating-point multiplication is one of the most basic and frequent digital signal processing operations, and its accuracy and throughput greatly decide the overall accuracy and throughput of the digital signal processors. Based on vectorizing a conventional double precision multiplier, we propose a multiple precision floating-point multiplier. It supports either one double precision multiplication for high accuracy or two parallel single precision multiplications for high throughput. The evaluation results show that the proposed multiplier is suitable for embedded DSPs. It consumes 8.9% less area than two single precision multipliers. Compared the configuration with a single precision multiplier and a double precision multiplier, the proposed multiplier consumes 30.1% less area.

Explore More