Deyuan Gao
Northwestern Polytechnical University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Deyuan Gao.
international conference on signal processing | 2012
Limin Han; Jianfeng An; Deyuan Gao; Xiaoya Fan; Xianglong Ren; Tao Yao
Due to technological parameters and constraints entailed in many-core processor with shared memory systems, it demands new solutions to the cache coherence problem. Directory-based coherence protocols have recently seemed as a possible scalable alternative for CMP designs. Unfortunately, with the number of on-chip cores increasing, many directory design strategies do not scale beyond a dozen cores due to huge energy and area costs for scaling the directories. We explored different NUCA design schemes for tiled many-core architecture, compared several conventional directory protocols, such as full-map directory protocol, sparse directory protocol, duplicate-tag-based directory protocol etc. and analyzed several novel cache protocols designed for many-core processor. At the end, we propose several design directions for scalable and adaptive cache coherence protocols for many-core processor.
international conference on computer and automation engineering | 2010
De-Li Wang; Deyuan Gao; Danghui Wang
With the increasing amounts of parallel programs, produced to effectively use the parallel computing resources on CMP (chip multi-processor), it is becoming more important to study these programs data access characteristics on CMP. In this paper, the cache behaviors of parallel programs running on CMP are simulated and analyzed . Based on the the great degree of shared data accesses among different threads and the locality of the shared accesses, an actively pushing cache strategy based on the awareness of shared data is proposed. The shared data is actively pushed to L1 caches of the slower threads before it is needed. This strategy takes the advantage of the potential high on-chip bandwidth while avoids the increased off-chip bandwidth demand of the prefetch technique due to the inaccuracy. Experimental results show that this strategy enhances the processing speeds of the slower threads. The improvement in performance of the parallel programs can be up to 15.1%.
ieee international conference on computer science and information technology | 2009
De-Li Wang; Deyuan Gao; Danghui Wang
A highly integrated system-on-chip (SOC) can be a viable replacement for a design based on legacy X86 series based design. Embedded processors for aviation field and industrial controls are the examples for aforesaid proposal. To reduce the high volume of old system, we design and implement a SOC embodying a low power X86 instruction compatible 32-bit CISC microprocessor. Outside the processor core, basic control modules are supplied, such as memory controller, interrupt management system, and other important IO devices. All the modules provide a high degree of existing system compatibility. As the control center of real time process system, a fast interrupt response time is quite important. We employ techniques both in and out of the processor core to reach this demand. For a robust design, a lot of attention is paid between the SOC and the outside world. The SOC is implemented by SMIC 0.18um Technology and contains about 4 million transistors. The processor core and external control parts operate at 133MHz and 66MHz respectively. The whole chip consumes about only 550mw on normal operation. Test results show that the SOC is quite reliable and has an average of 21.9%improvement on the interrupt response time.
international conference on signal processing | 2012
Tao Yao; Deyuan Gao; Xianglong Ren; Limin Han; Xiaoya Fan; Lei Yang
The multi-input floating-point (FP) adder is one of attractive solutions to accelerate algorithms including a lot of addition operations. However, a specialized multi-input adder will increase too much area to the floating-point unit. In this paper, we propose a novel FP function unit combing a 3-input FP adder with a traditional Multiply-Add-Fused (MAF) unit which is widely employed in many general processors and stream processors. Namely, the new FP function unit could perform both A × B + C and A + B + C. An improved architecture of the 3-input FP adder is proposed firstly which has the same accuracy with a FP adder that has an infinite internal width and only once rounding operation for two additions. Secondly, the architecture combining 3-input FP adder and Multiply-Add-Fused unit is presented which is compatible with IEEE-Std754. Lastly, the implementation results in single-precision data format are given with 180nm CMOS technology. Comparing with the MAF unit, the proposed function unit only increases delay and area by 2.8% and 30% respectively,which means the new function unit could accelerate A + B + C nearly 2 times than a MAF does. A small data format version of the proposed architecture has been verified by an exhaustive testing.
international conference on computer engineering and technology | 2010
De-Li Wang; Deyuan Gao; Danghui Wang
CMP (chip multiprocessor) is simply getting more ubiquitous these days. Most of the processors used in servers, desktop and even in embedded applications are featuring several cores. And it is predicted that chips with many more cores will become widespread in the near future. As cores on the same chip share the DRAM memory system, multiple threads executing simultaneously will interfere with each others memory requests. While recent studies have focused a lot on the multi-programmed applications, less work is done in the single multi-threaded domain. In this paper, we examine polices for managing the fairness of off-chip memory bandwidth among different threads in parallel applications. First, we show that there are also fairness problems existing between different threads from the same parallel program. Then we find one cause to this is that the shared data memory accesses are not explicitly classified. Based on this observation, we propose a sharing-aware DRAM scheduler design that provides fair service to different threads in parallel programs. Experimental results show that this scheme can effectively improve the fairness among different threads and also improve the performance of whole system.
ieee international conference on computer science and information technology | 2009
Jie Chen; Deyuan Gao; Qiaoshi Zheng
Low-power design for various applications has always been a challenge for system designers. Dynamic power management, by selectively shutting down idle components, is widely studied and considered to be effective in reducing power consumption. Management strategies based on online algorithm exhibit a feature of easy implementation and fast processing speed. However, merely based on the historical distribution of idle periods, these strategies will make inaccurate prediction if the real distribution of idle periods changes sharply. This paper presents an optimized adaptive dynamic power management for further power saving. We introduce a differential adjusting factor to optimize the exponential-average algorithm to rapidly and accurately adjust the predicted idle period to the real distribution. Experimenting results demonstrate that our policy of power management can reduce the power dissipation of processors in a larger scale and be utilized in diverse applications.
international conference on signal processing | 2012
Xianglong Ren; Deyuan Gao; Jianfeng An; Tao Yao; Limin Han; Xiaoya Fan
The Improved Asymmetric Multi-Channel Structure can effectively reduce the head of line blocking and provide an efficient promotion for the router performance. However, allocating channels in channel groups (CGs) uniformly will cause the buffer wasting and the power significant increasing. To resolve this issue, we present a novel channel planning algorithm which can customize the router design in Network-on-Chip (NoC). More precisely, given the traffic characteristics of the target application and the channel budget, our algorithm automatically assigns the channel number for each CG, in different input ports across each router, to match the traffic pattern, such that the overall power consumption is minimized. Experimental results show that, compared with channel uniform allocation, about 15~27% savings in power consumed by buffer can be achieved by our algorithm, while having the similar performance meanwhile.
international conference on embedded software and systems | 2005
Danghui Wang; Xiaoya Fan; Deyuan Gao; Shengbing Zhang; Jianfeng An
The purpose of this paper is to develop a flexible test method with high efficiency for core-based system-on-a-chip (SOC). The novel feature of the approach is the use of an embedded microprocessor/memory pair to test the remaining components of SOCs. The characteristics are: (1) Several IP cores can be tested in parallel; (2) The order of test tasks need not to be queued during test integration, but scheduled by test program. It is called microprocessor based self schedule and parallel BIST for SOC (MBSSP-BIST). By analyzing the bandwidth of test data, the feasibility of MBSSP-BIST is proved. Finally, several SOCs in ITC’02 benchmark are used to verify the approach and the results show that MBSSP-BIST can improve test efficiency significantly.
international conference on embedded software and systems | 2005
Yuanli Jing; Xiaoya Fan; Deyuan Gao; Jian Hu
Network-on-Chip is a new methodology of System-on-Chip design. It can be used to improve communication performance among many computing nodes of parallel DSP architectures. Simulations based on the 16-node 2D-mesh DragonFly DSP architecture show that the routing distance of 72.9% inter-node communication is 1. A fast local router is proposed to improve the performance of this communication. Experiments on our simulator show that overall inter-node communication delay is decreased by 59.4%.
Archive | 2011
Deyuan Gao; Hangpei Tian; Xiaoya Fan; Shengbing Zhang; Danghui Wang; Tingcun Wei; Xiaoping Huang; Meng Zhang; Ran Zheng