Weiwei Fu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weiwei Fu is active.

Explore More

Publication

Featured researches published by Weiwei Fu.

high performance computing and communications | 2014

An Exploration on Quantity and Layout of Wireless Nodes for Hybrid Wireless Network-on-Chip

Mingmin Yuan; Weiwei Fu; Tianzhou Chen; Minghui Wu

As the scaling of integration, massive remote communication has become the main bottleneck of system performance for network-on-chip (NoC). Most packets have to travel long distances from source to destination, leading to long latency and severe contention. Hybrid Wireless NoC(HWiNoC) has emerged as a popular method to handle remote transmission in NoC, in which packets can be modulated to wireless channel and delivered to remote nodes in just one hop. However wireless nodes introduces non-trivial overhead. In this paper, we explore the quantity and layout of wireless nodes in HwiNoC. We first define three optimization rules, with the main optimization targets being minimizing Maximum Distance(MD), Average Distance(AD) and Sum of Distance from each nodes to wireless nodes(SD) respectively. We further decide the optimal wireless count to ensure the maximum performance gain per wireless node, and propose a novel heuristic to find a near optimal target layout. Experiment results show that minimizing AD and SD outperforms minimizing MD by 6.21% and 8.19% in terms of average packet latency under synthetic traffic patterns, while minimizing AD exhibits more stability than the other two rules. And our heuristic layout introduces less than 1% performance loss in terms of AD and SD compared with the enumerated optimal layout, while the calculate complexity is reduced from O(N!) to O(N).

Microprocessors and Microsystems | 2014

Packet triggered prediction based task migration for network-on-chip

Tianzhou Chen; Weiwei Fu; Bin Xie; Chao Wang

Developing IC technology makes Network-on-Chip (NoC) an attractive architecture for future systems. Task migration is important for the overall performance of NoCs since the changing system state makes static task mapping improper for NoCs. The predictability of behaviors of applications makes it possible to use prediction to guide task migration. The trigger to initiate task migration is also an important parameter. In this paper, we first defined and analyzed predictabilities of applications using experimental results. We also compared different triggers for migration and concluded that trigger based on packets sent by single node is the best choice. We modified Genetic Algorithm (GA) mapping for migration and proposed 2 algorithms Simple Exchange (SE) and Benefit Assess (BA). A Node Lock mechanism is also used to reduce the number of migrations. These algorithms and node lock are evaluated using real applications. According to the experimental results, SE reduced 78.7% of migrations with 9% less reduction of latency compared to GA; BA reduced 27.2% of latency, It reduced 72.0% of migrations with almost the same performance compared to GA. The node lock mechanism removed 37.3% and 46.0% of migrations in SE and BA with almost the same performance.

high performance computing and communications | 2014

CABSR: Congestion Agent Based Source Routing for Network-on-Chip

Mingmin Yuan; Weiwei Fu; Tianzhou Chen; Wei Hu; Minghui Wu

Network-on-chip (NoC) has recently emerged as a primary paradigm for interconnecting ever increasing number of on-chip cores in future chip-multiprocessors (CMPs). As the advent of big data era, high volume of traffic injection, combined with the ever-changing patterns will exert severe pressure on some hotspots, leading to serious traffic congestion. These hot nodes will exhaust up the communication resources in those areas quickly, and prompt the expansion of congestion tree, which can deteriorate the global traffic condition. In this work, we propose a novel low-cost routing mechanism called Congestion Agent Based Source Routing (CABSR). We set special routing agents called Congestion Agents(CAs) on the edge of congestion area, which maintain the congestion information of an extended range of area. CAs take charge of throttling flows through the congested link, calculating a new appropriate path to help the packet penetrate the congestion area. Experimental results show that compared with oblivious DOR routing, local-adaptive and regional adaptive routing, CABSR achieves performance improvements by 8.3% and 9.24% on average in terms of average packet latency and throughput respectively. It proves that CABSR can efficiently alleviate the congestion condition, balance the load, and inhibit the expansion of congestion tree.

Journal of Parallel and Distributed Computing | 2014

Direct distributed memory access for CMPs

Weiwei Fu; Li Liu; Tianzhou Chen

On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory interfaces with on-chip cores. The packet-based memory access model (PDMA) has provided a scalable and flexible solution for distributed memory management, but suffers from complicated and costly on-chip network protocol translation and massive interferences among packets, which leads to unpredictable performance. In this paper we propose a direct distributed memory access (DDMA) model, in which remote memory can be directly accessed by local cores via remote-to-local virtualization, without network protocol translation. From the perspective of local cores, remote memory controllers (MC) can be directly manipulated through accessing the local agent MC, which is responsible for accessing remote memory through high-performance inter-tile communication. We further discuss some detailed architecture supports for the DDMA model, including the memory interface design, work flow and the protocols involved. Simulation results of executing PARSEC benchmarks show that our DDMA architecture outperforms PDMA in terms of both average memory access latency and IPC by 17.8% and 16.6% respectively on average. Besides, DDMA can better manage congested memory traffic, since a reduction of bandwidth in running memory-intensive SPEC2006 workloads only incurs 18.9% performance penalty, compared with 38.3% for PDMA.

ieee international conference on high performance computing data and analytics | 2012

Design of a High-Throughput NoC Router with Neighbor Flow Regulation

Weiwei Fu; Jingcheng Shao; Bin Xie; Tianzhou Chen; Li Liu

The throughput of a Network-on-Chip (NoC) mainly depends on the specific router microarchitecture design. Besides, coordination among routers also plays an important role in the performance of NoC, since router behaviors are closely related, especially between neighbors. In this work we focus on such strong tie between neighbor routers and propose a novel router microarchitecture design with neighbor flow regulation (NFR). We use simple logic and several low-cost wires to build an additional regulation network. Each module in the network collects information of flows in neighbor routers and regulates the arbitration schemes on them to prevent congestion and starvation. Simulation results show that our router design with NFR mechanisms can increase network throughput by 6.7% on average, and is able to achieve nearly 28% improvement in switch matching efficiency under hot-spot traffic.

international symposium on parallel and distributed computing | 2014

Agent-Based Memory Access for Many-Core CMPs

Weiwei Fu; Mingmin Yuan; Tianzhou Chen; Li Liu

The trend of increasing on-chip core counts and integrating memory controllers (MCs) makes core-to-memory communication a major obstacle in scaling memory access performance for many-core CMPs. Unmanaged on-chip traffic for long-distance memory accesses, combined with information asymmetry between cores and remote MCs may lead to serious inefficiency in processing massive parallel memory accesses. In this paper we propose a novel agent-based memory access model for CMPs. We employ multiple agents inside the network to assist memory accesses to remote MCs, whose role lies in conducting memory requests from nearby cores, merging some repetitive memory requests and optimizing the scheduling under the backpressure of target MCs. We further describe a simple but effective case for agent-based memory access called Quad Agent, which deploys static agent modules in each quadrant of the network to serve requests towards MCs in other quadrants. The details of memory access merging, scheduling schemes and the architectural supports are discussed. Simulation results show that Quad Agent can reduce memory access latency by 13.3% on average and achieve 20.5% IPC speedup compared with the baseline. The performance promotion is due to increased memory access merge rate, row buffer hit rate and also prevention of traffic congestion and bank starvation.

high performance computing and communications | 2014

SmartMig: A Case for Page Migration and Self-Interleaving for On-Chip Distributed Memory Systems

Weiwei Fu; Mingmin Yuan; Tianzhou Chen; Qingsong Shi; Li Liu; Minghui Wu

This paper tries to optimize the placements of data pages, which have a strong impact on system performance. We find that both core-to-memory distance and contention on MCs and interconnects are critical. Migrating pages to their page access center can mini-mize average memory access distance, but may cause serious contention and congestion, necessitating further schemes for load balancing. Based on these observations, we propose a novel runtime mechanism called SmartMig, in which we mi-grate data pages to shorten memory access distance, while employ page self-interleaving to balance the load across the nodes. We propose models and algorithms to decide the fate of candidate pages. Simulation results show that SmartMig achieves performance improvements by 26.9% and 21.0% in terms of normalized IPC and average memory access latency, which is a result of significant reduction of core-to memory distance and load in balance.

high performance computing and communications | 2014

Benefit of Unbalanced Traffic Distribution for Improving Local Optimization Efficiency in Network-on-Chip

Weiwei Fu; Mingmin Yuan; Tianzhou Chen; Qingsong Shi; Li Liu; Minghui Wu

Recently proposed technologies to enhance NoC performance can be applied to only part of the network to save hardware overhead and energy. This paper discussed the influence of network traffic distribution to local optimization efficiency. We introduce two-dimensional finite difference method to quantize the non-uniformity of the NoC traffic. Then, we describe hotspot and potential optimization node (PON) in NoC mathematically and some low-cost architecture support to discover them. Experimental results show that unbalanced traffic pattern may improve the efficiency of some local optimization methods, and PON is a better choice than hotspot to perform single-node optimization.

high performance computing and communications | 2014

HSR: Hierarchical Source Routing Model for Network-on-Chip

Mingmin Yuan; Weiwei Fu; Tianzhou Chen; Minghui Wu

Distributed Routing Model (DRM) and Centralized Routing Model (CRM) are two mainstream routing models for Network-on-Chip (NoC). DRM is scalable and efficient, but its state learning is costly because exchanging network state must be in flooding form which may consume large amount of bandwidth. CRM monitors and send real-time network state to a central routing node as the basis of routing path computing which can achieve the optimal routing decision for each packet but suffers from huge design complexity and non-scalability. In this work, we propose a novel routing model called Hierarchical Source Routing (HSR) in which a tree-like hierarchical control network is built above the original network. Each non-leaf node in the tree is regarded as a control node that manages the network states and routing decisions of a group of node in the lower level. We propose two state update modes of the non-leaf nodes named HSR-PATH and HSR-PORT, which applies the average minimum path load and weighted port load respectively for the efficacy of optimal path calculation. Their implementation and overhead are also discussed. Our experiment shows that HSR-PATH and HSR-PORT outperform DRM in terms of Saturated Injection Rate by 26.1% and 17.9% on average respectively, and achieves 96.6% performance compared with CRM with much less power consumption. Finally, we use Throughput Energy Ratio (TER) as the comprehensive indicator and HSR-PATH and HSR-PORT outperform CRM by 59.4% and 13.4% respectively and DRM by 13.7% and 17.3% respectively in terms of TER.

computer and information technology | 2014

Design and Evaluation of Virtual Channel-Based Optical-Electrical Interface for Optical Network-on-Chip

Weiwei Fu; Mingmin Yuan; Tianzhou Chen; Minghui Wu

Nanophotonic interconnects have recently emerged as a promising candidate for on-chip communication in future chip multi-processors (CMPs), by providing high bandwidth chip wide communication at low latency and power overhead. Optical electrical interfaces are important components in an optical NoC (ONoC), whose role lies in transporting data packets from the network interfaces (NIs) of local processing units to the optical modulators, and dispatching the demodulated packets to the destination NIs. In this paper we propose virtual channel (VC) based interface architecture and introduce its implementation in detail. Control table-based virtual channels are used as our interface organization which may eliminate HoL blocking, trigger moderate nominations and show more flexibility. Simulation results show that our VC-based architecture can achieve up to 90% performance of the full VOQ, while consuming only 71% static energy and surpass the full VOQ and m-VOQ in terms of throughput-energy ratio by 80% and 4% respectively.

Explore More