Mingfa Zhu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mingfa Zhu is active.

Explore More

Publication

Featured researches published by Mingfa Zhu.

Concurrency and Computation: Practice and Experience | 2015

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Yuan Tao; Yangdong Deng; Shuai Mu; Zhenzhong Zhang; Mingfa Zhu; Limin Xiao; Li Ruan

Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi‐core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU‐based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall‐clock time of bi‐conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright

Cluster Computing | 2013

Asymmetrical topology and entropy-based heterogeneous link for many-core massive data communication

Yuhang Liu; Mingfa Zhu; Limin Xiao; Jue Wang

As the need for data processing and communication increases, and likewise, as the number of processing cores placed on a given single chip increases, improving the performance of interconnection networks is vital. In the present work, traditional topologies are re-examined. Torus is shown to be a good structure in terms of average latency and symmetry. When using torus in combination with high process levels, it is possible to design new, yet asymmetrical topologies that can meet the high communication performance requirements of many-core processors and also suit a large variety of traffic patterns. Firstly, this paper presents two novel and torus-like topologies called xtorus and xxtorus, which are evaluated by using both theoretical analysis and experimental simulation methods. For theoretical analysis, an algorithm for computing link path diversity and link entropy is given. The analysis shows that, compared with mesh, xmesh and torus, the proposed topologies have better properties in terms of diameter, average latency, throughput, and path diversity. Although more links are added, the number of links is of the same order of magnitude with that of mesh, xmesh, and torus. Proposed topologies also take advantage of increasingly higher levels of the VLSI process. Simulations on GEM5 reveal that xtorus has better scalability, and that its average latency is less than that of mesh, xmesh and torus by significant proportions respectively, particularly when the network scale is larger. Moreover, for different traffic patterns, its performance swing is less than that of mesh. Furthermore, in the present work, the proposed topologies are both asymmetrical and based on the entropy difference of the links in the topology. A strategy for heterogeneous link design is presented, which enables designers to trade off between delay, power and area according to a concrete integrated circuit design scene.

Journal of Computer Science and Technology | 2017

A Power and Area Optimization Approach of Mixed Polarity Reed-Muller Expression for Incompletely Specified Boolean Functions

Zhenxue He; Limin Xiao; Li Ruan; Fei Gu; Zhisheng Huo; Guangjun Qin; Mingfa Zhu; Longbing Zhang; Rui Liu; Xiang Wang

The power and area optimization of Reed-Muller (RM) circuits has been widely concerned. However, almost none of the exiting power and area optimization approaches can obtain all the Pareto optimal solutions of the original problem and are efficient enough. Moreover, they have not considered the don’t care terms, which makes the circuit performance unable to be further optimized. In this paper, we propose a power and area optimization approach of mixed polarity RM expression (MPRM) for incompletely specified Boolean functions based on Non-Dominated Sorting Genetic Algorithm II (NSGA-II). Firstly, the incompletely specified Boolean function is transformed into zero polarity incompletely specified MPRM (ISMPRM) by using a novel ISMPRM acquisition algorithm. Secondly, the polarity and allocation of don’t care terms of ISMPRM is encoded as chromosome. Lastly, the Pareto optimal solutions are obtained by using NSGA-II, in which MPRM corresponding to the given chromosome is obtained by using a chromosome conversion algorithm. The results on incompletely specified Boolean functions and MCNC benchmark circuits show that a significant power and area improvement can be made compared with the existing power and area optimization approaches of RM circuits.

international conference on communications | 2015

Flow entries installation based on distributed SDN controller

Rui Liu; Mingfa Zhu; Limin Xiao; Li Ruan; Yuanhao Zhou; Wenbo Duan; Deguo Li

Software-Defined Networking (SDN), which is relatively a new concept, proposes a more intelligent way to manage network resources. The controller is a key component in SDN, which sends management and forwarding policies to switches by flow entries. With the increase of network scale, there are too many flows to install, especially in data center (DC). The number of flow entries is up to 757,000. This has exceeded the processing capability of centralized controller. Thus we put forward a distributed architecture to install flows. There are two ways to install flow entries, proactive and reactive. In proactive way, flows are stored in the distributed storage in a key-value way, and use the identifier of switch as the key of flows. As for reactive way, one controller instance sends flow entries to other controller instances other than the synchronize function of distributed storage. We build a prototype system on Floodlight to demonstrate our design and test the performance of our solution. According to the experiment, our design to install flows has a good scalability and better performance. In the reactive mode, it can save 10 times of time than the synchronize way.

international conference on parallel and distributed systems | 2014

Atomic reduction based sparse matrix-transpose vector multiplication on GPUs

Yuan Tao; Yangdong Deng; Shuai Mu; Mingfa Zhu; Limin Xiao; Li Ruan; Zhibin Huang

Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.

Archive | 2012

Analysis of Allocation Deviation in Multi-core Shared Cache Pseudo-partition

Zhibin Huang; Mingfa Zhu; Limin Xiao

Allocation deviation is a commonly encountered problem in cache partition mechanism, especially pseudo partition mechanism, such as PIPP. We augment some bits to the line’s status field to store the source core Id of incoming cache requests and sample the whole cache, then quantitatively analyze allocation deviation of multi-core pseudo-partition in shared last-level-cache. And we emphasize some factors that influence allocation deviation, such as the cache quota, the contention of concurrent workingset etc. Furthermore we discuss flexible handling of allocation deviation to benefit to the whole performance according to the cache utility characteristics of the benchmarks. Through our experiments and analysis, we conclude that in pseudo-partition, allocation deviation happens frequently due to the contention and improper cache quota, and it needs to be more flexibly handled.

Frontiers of Computer Science in China | 2018

EDOA: an efficient delay optimization approach for mixed-polarity Reed-Muller logic circuits under the unit delay model

Zhenxue He; Limin Xiao; Fei Gu; Li Ruan; Zhisheng Huo; Mingzhe Li; Mingfa Zhu; Longbing Zhang; Rui Liu; Xiang Wang

Delay optimization has recently attracted significant attention. However, few studies have focused on the delay optimization of mixed-polarity Reed-Muller (MPRM) logic circuits. In this paper, we propose an efficient delay optimization approach (EDOA) for MPRM logic circuits under the unit delay model, which can derive an optimal MPRM logic circuit with minimum delay. First, the simplest MPRM expression with the fewest number of product terms is obtained using a novel Reed-Muller expression simplification approach (RMESA) considering don’t-care terms. Second, a minimum delay decomposition approach based on a Huffman tree construction algorithm is utilized on the simplest MPRM expression. Experimental results on MCNC benchmark circuits demonstrate that compared to the Berkeley SIS 1.2 and ABC, the EDOA can significantly reduce delay for most circuits. Furthermore, for a few circuits, while reducing delay, the EDOA incurs an area penalty.

The Computer Journal | 2015

Lessen Interflow Interference Using Virtual Channels Partitioning

Guangjun Qin; Mingfa Zhu; Limin Xiao; Li Ruan

Interconnection networks are a significant consideration for high-performance computing and the datacenter. However, interflow interference seriously impacts the communication performance and even causes disastrous congestion. The paper reports a virtual channel (VC)-sharing scheme that is aimed to separate VCs into many groups, and assign them to data flows based on the destination address. The technique can effectively isolate various traffics into separate VCs groups such that heavy loads have a lesser influence on the other normal traffics. As a consequence, we achieve a slimming congestion tree. In the proposal, the routing algorithm is a two-stage selection that includes the port selection and the VC group selection, respectively. Each of them has an independently selecting algorithm so that routing algorithm is a combined tactic by Cartesian product. The experiment represents that our scheme has excellent performance on adversarial traffics. Using our scheme, the growth curve is linear and slow after crossing the saturation point. For benign traffic patterns, our scheme does not effect any oblivious change on the communication performance when the system receives a lower injection rate.

Journal of Systems Engineering and Electronics | 2014

Elastic pointer directory organization for scalable shared memory multiprocessors

Yuhang Liu; Mingfa Zhu; Limin Xiao

In the field of supercomputing, one key issue for scal-able shared-memory multiprocessors is the design of the directory which denotes the sharing state for a cache block. A good direc-tory design intends to achieve three key attributes: reasonable memory overhead, sharer position precision and implementation complexity. However, researchers often face the problem that gain-ing one attribute may result in losing another. The paper proposes an elastic pointer directory (EPD) structure based on the analysis of shared-memory applications, taking the fact that the number of sharers for each directory entry is typical y smal . Analysis re-sults show that for 4 096 nodes, the ratio of memory overhead to the ful-map directory is 2.7%. Theoretical analysis and cycle-accurate execution-driven simulations on a 16 and 64-node cache coherence non uniform memory access (CC-NUMA) multiproces-sor show that the corresponding pointer overflow probability is reduced significantly. The performance is observed to be better than that of a limited pointers directory and almost identical to the ful-map directory, except for the slight implementation complex-ity. Using the directory cache to explore directory access locality is also studied. The experimental result shows that this is a promis-ing approach to be used in the state-of-the-art high performance computing domain.

Archive | 2011