Minghui Wu
Zhejiang University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Minghui Wu.
ieee international conference on dependable, autonomic and secure computing | 2011
Baozhong Yu; Jianliang Ma; Tianzhou Chen; Minghui Wu
Last-level caches (LLC) grow large with significant power consumption. As LLCs capacity increases, it becomes quite inefficient. As recent studies show, a large percent of cache blocks are dead during the cache time. There is a growing need for LLC management to reduce the number of dead block in the LLC. However, there is a significant power requirement for the dead blocks in-placement and replacement operations. In this paper, we introduce a global priority table predictor, a technique which is used for determining a cache blocks priority when it attempts to insert into the LLC. It is similar to previous predictors, such as reuse distance and dead block predictor. The global priority table is indexed by the hash value of the block address and stores the priority value of the associate cache block. The priority value can be used to drive a dead block replacement and bypass optimization. Through the priority table, a large number of dead blocks could be bypassed. It achieves an average reduction of 13.2% in the number of LLC miss for twenty single-thread workloads from the SPEC2006 suite and 29.9% for ten multi-programmed workloads. It also yields a geometric mean speedup of 8.6% for single-thread workloads and a geometric mean normalized weighted speedup of 39.1% for multi-programmed workloads.
high performance computing and communications | 2014
Mingmin Yuan; Weiwei Fu; Tianzhou Chen; Minghui Wu
As the scaling of integration, massive remote communication has become the main bottleneck of system performance for network-on-chip (NoC). Most packets have to travel long distances from source to destination, leading to long latency and severe contention. Hybrid Wireless NoC(HWiNoC) has emerged as a popular method to handle remote transmission in NoC, in which packets can be modulated to wireless channel and delivered to remote nodes in just one hop. However wireless nodes introduces non-trivial overhead. In this paper, we explore the quantity and layout of wireless nodes in HwiNoC. We first define three optimization rules, with the main optimization targets being minimizing Maximum Distance(MD), Average Distance(AD) and Sum of Distance from each nodes to wireless nodes(SD) respectively. We further decide the optimal wireless count to ensure the maximum performance gain per wireless node, and propose a novel heuristic to find a near optimal target layout. Experiment results show that minimizing AD and SD outperforms minimizing MD by 6.21% and 8.19% in terms of average packet latency under synthetic traffic patterns, while minimizing AD exhibits more stability than the other two rules. And our heuristic layout introduces less than 1% performance loss in terms of AD and SD compared with the enumerated optimal layout, while the calculate complexity is reduced from O(N!) to O(N).
ieee international conference on high performance computing data and analytics | 2012
Ma Jianliang; Shaobin Zhang; Tongsen Hu; Minghui Wu; Tianzhou Chen
The extensible markup language XML is a standard information representation tool and is playing an increasingly important role in many fields, like database and web services. XML parsing is a core task in XML processing, and many XML parsers are presented both in software and hardware community. In order to accelerate XML parsing, parallel XML parsing method is introduced. In this paper, we detail the design of a parallel speculative Dom-based XML parser (PSDXP) which is implemented of Field Programmable Gate Array (FPGA). Both two threads parallelism and four threads parallelism PSDXPs are implemented on a Xilinx Virtex-5 board. The former can achieve 0.5004 CPB and 1.998 Gbps, and the later capable of running at 0.2505 CPB and 3.992 Gbps.
high performance computing and communications | 2014
Licheng Yu; Tianzhou Chen; Minghui Wu; Li Liu
With the rapid growth in demand of massive data processing and the limitation of process development in microprocessor, GPGPU gains more and more attentions to provide huge power of data parallelism. Tightly-coupled CPU and GPGPU that share the LLC (last level cache) enables fine-grained workload offload between CPU and GPGPU. In the paper, we focus on one data transfer pattern where the data are usually in form of independent element, each of which is to be processed by the other processor when it is ready. Traditionally, CPU prepares all the data that are to be processed by GPGPU before starting GPGPU. This creates long waiting time, and the shared LLC may suffer from cache trashing if the work set can not fit in the LLC. To alleviate these problems, we propose the LLC buffer as a data transfer mechanism between CPU and GPGPU on shared LLC. The LLC buffer exploits part of LLC storage to work as one or more stream buffers, and stashes each data element as an independent transfer unit. With the help of LLC buffer, we achieve an average speedup of 1.48x and eliminate 1346x memory writes (cache evictions) from LLC.
ieee international conference on high performance computing data and analytics | 2012
Wei Shi; Minghui Wu; Shuoping Wang; Min Guo; Bin Peng; Bin Ouyang; Tianzhou Chen
In recent years, with the advent of mobile Internet, Mobile Widget as an important element has been rapid development. It has become the mobile Internet hotspot. Mobile Widget to integrate Internet and local resources, available standard, HTML5, CSS (Cascading Style Sheets), and JavaScript Mobile Widget is the trend of technology development. But because of the different Mobile Widget platform used different standards, is not fully compatible with each other, so cross-platform development of mobile widget is still challenge, and one of the most difficult problems is implementation of cross-platform API. It makes Widget application for third party providers need to make a variety of different Widget for a business application to adapt to a different Widget platform. In order to solve this problem, this article proposes a cross-platform implementation scheme of local resource access Mobile Widget, experimental success and on different platforms.
high performance computing and communications | 2015
Yulong Pei; Licheng Yu; Minghui Wu; Tianzhou Chen
With the massive processing power, GPGPU can execute thousands of threads in parallel at the cost of highmemory bandwidth to support the large number of concurrent memory requests. To alleviate the demands, GPGPU adopts memory access coalescing to reduce the memory requests issued to memory system. In this paper, we first introduced the concept of memory access distance, and classify GPGPU programs into three types according to their memory access distances. We discovered that programs with large but equal memory access distance are popular in GPGPU, which, however, cannot be optimized by the original memory access coalescing. Thus, we proposed equidistant memory access coalescing, which is able to merge requests with any equal memory access distance. We evaluated our method with 30 benchmarks. Compared with original memory access coalescing, equidistant memory access coalescing can improve performance of 19 benchmarks among them. For the benchmarks with equal and large memory access distance, the average speedup is 151% and the maximum speedup is 200%. The memory access requests are reduced to 32% on average.
high performance computing and communications | 2014
Mingmin Yuan; Weiwei Fu; Tianzhou Chen; Wei Hu; Minghui Wu
Network-on-chip (NoC) has recently emerged as a primary paradigm for interconnecting ever increasing number of on-chip cores in future chip-multiprocessors (CMPs). As the advent of big data era, high volume of traffic injection, combined with the ever-changing patterns will exert severe pressure on some hotspots, leading to serious traffic congestion. These hot nodes will exhaust up the communication resources in those areas quickly, and prompt the expansion of congestion tree, which can deteriorate the global traffic condition. In this work, we propose a novel low-cost routing mechanism called Congestion Agent Based Source Routing (CABSR). We set special routing agents called Congestion Agents(CAs) on the edge of congestion area, which maintain the congestion information of an extended range of area. CAs take charge of throttling flows through the congested link, calculating a new appropriate path to help the packet penetrate the congestion area. Experimental results show that compared with oblivious DOR routing, local-adaptive and regional adaptive routing, CABSR achieves performance improvements by 8.3% and 9.24% on average in terms of average packet latency and throughput respectively. It proves that CABSR can efficiently alleviate the congestion condition, balance the load, and inhibit the expansion of congestion tree.
Journal of Systems Architecture | 2014
Licheng Yu; Xingsheng Tang; Minghui Wu; Tianzhou Chen
Abstract General-purpose graphics processing unit (GPGPU) plays an important role in massive parallel computing nowadays. A GPGPU core typically holds thousands of threads, where hardware threads are organized into warps. With the single instruction multiple thread (SIMT) pipeline, GPGPU can achieve high performance. But threads taking different branches in the same warp violate SIMD style and cause branch divergence. To support this, a hardware stack is used to sequentially execute all branches. Hence branch divergence leads to performance degradation. This article represents the PDOM (post dominator) stack as a binary tree, and each leaf corresponds to a branch target. We propose a new PDOM stack called PDOM-ASI, which can schedule all the tree leaves. The new stack can hide more long operation latencies with more schedulable warps without the problem of warp over-subdivision. Besides, a multi-level warp scheduling policy is proposed, which lets part of the warps run ahead and creates more opportunities to hide the latencies. The simulation results show that our policies achieve 10.5% performance improvements over baseline policies with only 1.33% hardware area overhead.
high performance computing and communications | 2016
Minghui Wu; Yulong Pei; Licheng Yu; Tianzhou Chen; Xueqing Lou; Tiefei Zhang
Recently, researchers discovered a GPU has some advantages for non-graphic computing. CPU-GPU heterogeneous architecture combines CPU and GPU to a chip and makes GPU easier to run non-graphic programs. Researchers also proposed LLC(last-level cache) to store and exchange data between CPU and GPU. We discover the LLC hit rate has great influence on memory access performance and systems performance. Therefore, we propose the WAP(warp feature aware prefetching method) for improving the LLC hit rate and memory access performance. We combine GPGPU-sim and GEM5 to a CPU-GPU heterogeneous many-core simulator, add an LLC in this simulator and choose 10 representative benchmarks. We compare this method with the MAP method. The experimental result illustrates the WAP improves 11.8% than the MAP on the LLC hit rate and 11.39% on IPC.
high performance computing and communications | 2014
Mingmin Yuan; Weiwei Fu; Tianzhou Chen; Minghui Wu
Distributed Routing Model (DRM) and Centralized Routing Model (CRM) are two mainstream routing models for Network-on-Chip (NoC). DRM is scalable and efficient, but its state learning is costly because exchanging network state must be in flooding form which may consume large amount of bandwidth. CRM monitors and send real-time network state to a central routing node as the basis of routing path computing which can achieve the optimal routing decision for each packet but suffers from huge design complexity and non-scalability. In this work, we propose a novel routing model called Hierarchical Source Routing (HSR) in which a tree-like hierarchical control network is built above the original network. Each non-leaf node in the tree is regarded as a control node that manages the network states and routing decisions of a group of node in the lower level. We propose two state update modes of the non-leaf nodes named HSR-PATH and HSR-PORT, which applies the average minimum path load and weighted port load respectively for the efficacy of optimal path calculation. Their implementation and overhead are also discussed. Our experiment shows that HSR-PATH and HSR-PORT outperform DRM in terms of Saturated Injection Rate by 26.1% and 17.9% on average respectively, and achieves 96.6% performance compared with CRM with much less power consumption. Finally, we use Throughput Energy Ratio (TER) as the comprehensive indicator and HSR-PATH and HSR-PORT outperform CRM by 59.4% and 13.4% respectively and DRM by 13.7% and 17.3% respectively in terms of TER.