Tianzhou Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tianzhou Chen is active.

Explore More

Publication

Featured researches published by Tianzhou Chen.

computer and information technology | 2010

Homogeneous NoC-based FPGA: The Foundation for Virtual FPGA

Jie Yang; Like Yan; Lihan Ju; Yuan Wen; Shaobin Zhang; Tianzhou Chen

Reconfigurable computing based on FPGAs (Field Programmable Gate Arrays) has been a promising solution to improve the performance with high flexibility. However, the physical capacity limitation of FPGAs prevents its wide adoption in real world. In this paper, a homogeneous NoC-based FPGA architecture is proposed, in which reconfigurable and I/O resources are interconnected via NoC so that reconfigurable modules can be placed anywhere once enough space available. Meanwhile, a virtual FPGA is proposed with which over large circuit can be implemented on a limited capacity FPGA. The experiment verified that our approach can provide more flexible reconfiguration, and combing NOC on FPGA, the resource utilization increased within 44.7%-53.5% because of the fragment in CRs benefit from such kind of dynamic partial configuration.

international conference on supercomputing | 2009

Less reused filter: improving l2 cache performance via filtering less reused lines

Lingxiang Xiang; Tianzhou Chen; Qingsong Shi; Wei Hu

The L2 cache is commonly managed using LRU policy. For workloads that have a working set larger than L2 cache, LRU behaves poorly, resulting in a great number of less reused lines that are never reused or reused for few times. In this case, the cache performance can be improved through retaining a portion of working set in cache for a period long enough. Previous schemes approach this by bypassing never reused lines. Nevertheless, severely constrained by the number of never reused lines, sometimes they deliver no benefit due to the lack of never reused lines. This paper proposes a new filtering mechanism that filters out the less reused lines rather than just never reused lines. The extended scope of bypassing provides more opportunities to fit the working set into cache. This paper also proposes a Less Reused Filter (LRF), a separate structure that precedes L2 cache, to implement the above mechanism. LRF employs a reuse frequency predictor to accurately identify the less reused lines from incoming lines. Meanwhile, based on our observation that most less reused lines have a short life span, LRF places the filtered lines into a small filter buffer to fully utilize them, avoiding extra misses. Our evaluation, for 24 SPEC 2000 benchmarks, shows that augmenting a 512KB LRU-managed L2 cache with a LRF having 32KB filter buffer reduces the average MPKI by 27.5%, narrowing the gap between LRU and OPT by 74.4%.

computer and information technology | 2010

An Efficient Power-Aware Optimization for Task Scheduling on NoC-based Many-core System

Wei Hu; Xingsheng Tang; Bin Xie; Tianzhou Chen; Dazhou Wang

With the development of the semiconductor industry, more processors can be integrated onto a single chip. Network-on-Chip (NoC) is an efficient solution for the interconnections on chip for many-core system with many processor cores on chip. However, enhancing performance with lower power consumption is still a challenge. The core issue is the mapping of applications to NoC. A common method is to find processes with high communication with each other and map them to neighborhoods. Thus, they can reduce the communication distance and avoid unnecessary energy cost. This work proposed an online scheduling method, which aims at the optimization of task scheduling algorithm with low communication energy consumption. The communication status of applications at run time is analyzed first. Then, the algorithm will compute the mapping method dynamically and implement the real-time scheduling online. Experimental results based on simulation show that the algorithm proposed in this review can achieve more than 30% communication energy saving with low complexity.

computer and information technology | 2010

Smartphone Software Development Course Design Based on Android

Wei Hu; Tianzhou Chen; Qingsong Shi; Xueqing Lou

Mobile computing is popular when wireless network has been deployed almost everywhere. Smartphones have been the important tools in our society for the abundant functions including communication, entertainment and online office etc as the pivotal devices of mobile computing. Smartphone software development has also become more important than before. Android is one of the emerging leading operating systems for smartphones as an open source system platform. Many smartphones have adopted this platform and more smartphones will do so in the future. It is also an emerging problem on how to develop software for smartphones based on Android and those platforms like it. We propose smartphone software development course design based on Android in this paper. What this course focuses is how to teach the development technology to Students. The course design has two parts including the syllabus design and hands-on lab design. At the same time, the innovations are also described in detail and these innovations play a key role in the teaching.

computational science and engineering | 2010

Input-Driven Reconfiguration for Area and Performance Adaption of Reconfigurable Accelerators

Like Yan; Yuan Wen; Tianzhou Chen

Attaching a reconfigurable loop accelerator to a processor for improving the performance and the efficiency of the system, which can be further enhanced by unrolling the loop to change its parallelism in a better way, is a promising development. The more a loop is unrolled, the wider the reconfigurable area that is exposed. However, the utilization of a loop accelerator is highly linked with the input. Also, in some situations, one will be wasting area to overunroll the loop. With a focus on the area and the performance balance, this paper proposes a dynamically adaptive reconfigurable accelerator framework for the processor/RL architecture. In the framework, reconfiguration of the accelerator is driven by the input. An accelerator selection model is presented for selecting an accelerator at run time among the predefined input patterns. Also, with the help of a detailed illustration of a bzip2 case study, experimental results were provided for the feasibility of the approach, which showed that up to 69.21% reconfigurable area is saved at a cost of 2.63% performance slowdown in the best case.

ieee international conference on high performance computing data and analytics | 2012

Exploring Potential Parallelism of Sequential Programs with Superblock Reordering

John M. Ye; Tianzhou Chen

The growing number of processing cores in a single CPU is demanding more parallelism from sequential programs. But in the past decades few work has succeeded in automatically exploiting enough parallelism, which casts a shadow over the many-core architecture and the automatic parallelization research. However, actually few work was tried to understand the nature, inner structure, or amount, of the potentially available parallelism in programs. In this paper we will try to analyze at runtime the dynamic data dependencies among superblocks of sequential programs. We designed a Meta Reorder Buffer (Meta-RB) to measure and exploit the available parallelism, with which the superblocks are dynamically analyzed, reordered and dispatched to run in parallel on an ideal many-core processor, and the data dependencies and program correctness are still maintained. In our experiments, we observed that with the superblock reordering, the potential speedup ranged from 1.627 to 95.275, and reached 22.852 on average. The results shows that the potential parallelism of normal programs is still far from fully exploited by existing technologies, which makes the automatic parallelization a promising research direction for many-core architectures.

computer and information technology | 2010

A Reconfigurable Processor Architecture Combining Multi-core and Reconfigurable Processing Unit

Like Yan; Binbin Wu; Yuan Wen; Shaobin Zhang; Tianzhou Chen

It’s a promising way to improve performance significantly by adding reconfigurable processing unit to a general purpose processor. In this paper, a Reconfigurable Multi-Core (RMC) architecture combining general multi-core and reconfigurable logic is proposed. The Reconfigurable Logic is logically divided into Reconfigurable Processing Units (RPUs), which are coupled with General Purpose Cores (GPCs) as co-processors via a configurable full crossbar switch. And a RPU-Manager (RPU-M) is designed to manage the RPUs. To verify RMC, a simulation methodology based on the Simics and Virtex 5 FPGA is adopted, which simplifies the simulation and assures the accuracy of the hardware function core. The experimental results of workloads 3-DES, AES and JPEG_ENC show a 2.34X average speedup over software implementation, while the data and control transfer overhead is acceptable.

Computing | 2014

Potential thread-level-parallelism exploration with superblock reordering

John M. Ye; Hui Yan; Honglun Hou; Tianzhou Chen

The growing number of processing cores in a single CPU is demanding more parallelism from sequential programs. But in the past decades few work has succeeded in automatically exploiting enough parallelism, which casts a shadow over the many-core architecture and the automatic parallelization research. However, actually few work was tried to understand the nature, or amount, of the potentially available parallelism in programs. In this paper we will analyze at runtime the dynamic data dependencies among superblocks of sequential programs. We designed a meta re-arrange buffer to measure and exploit the available parallelism, with which the superblocks are dynamically analyzed, reordered and dispatched to run in parallel on an ideal many-core processor, while the data dependencies and program correctness are still maintained. In our experiments, we observed that with the superblock reordering, the potential speedup ranged from 1.08 to 89.60. The results showed that the potential parallelism of normal programs was still far from fully exploited by existing technologies. This observation makes the automatic parallelization a promising research direction for many-core architectures.

The Journal of Supercomputing | 2013

An energy-aware online task mapping algorithm in NoC-based system

Bin Xie; Tianzhou Chen; Wei Hu; Xingsheng Tang; Dazhou Wang

With the development of the semiconductor technology, more processors can be integrated onto a single chip. Network-on-Chip is an efficient communication solution for many-core system. However, enhancing performance with lower energy consumption is still a challenge. One critical issue is mapping applications to NoC. This work proposed an online mapping method, which optimizes task mapping algorithm to reduce communication energy consumption. The communication status of applications at runtime is analyzed first. Then, the algorithm computes the mapping placement dynamically and implements the real-time mapping online. Experimental results based on simulation show that the algorithm proposed in this article can achieve more than 20% communication energy saving compared with first fit mapping and nearest neighbor mapping. The migration cost caused by the remapping process is also considered, and can be calculated at the runtime to estimate the effect of remapping.

Journal of Computer and System Sciences | 2013

Regional cache organization for NoC based many-core processors

John M. Ye; Man Cao; Zening Qu; Tianzhou Chen

As the number of Processing Elements (PEs) on a single chip keeps growing, we are now facing with slower memory references due to longer wire delay, intenser on-chip resource contention and higher network traffic congestion. Network on Chip (NoC) is now considered as a promising paradigm of inter-core connection for future many-core processors. In this paper, we examined how the regional cache organizations drastically reduce the average network latency, and proposed a regional cache architecture with Delegate Memory Management Units (D-MMUs) for NoC based processors. Experiments showed that the L2 cache access latency is largely determined by its organization and inter-connection paradigm with PEs in the NoC, and that the regional organization is essentially important for better NoC cache performance.

Explore More