Is this you? Create Your Porfile

Yizhuo Wang

Beijing Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yizhuo Wang is active.

Explore More

Publication

Featured researches published by Yizhuo Wang.

international conference on parallel processing | 2016

Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU

Akrem Benatia; Weixing Ji; Yizhuo Wang; Feng Shi

Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed recently for this kernel on the GPU side. Since the performance of these sparse formats varies significantly according to the sparsity characteristics of the input matrix and the hardware specifications, no one of them can be considered as the best one to use for every sparse matrix. In this paper, we address the problem of selecting the best representation for a given sparse matrix on GPU by using a machine learning approach. First, we present some interesting and easy to compute features for characterizing the sparse matrices on GPU. Second, we use a multiclass Support Vector Machine (SVM) classifier to select the best format for each input matrix. We consider in this paper four popular formats (COO, CSR, ELL, and HYB), but our work can be extended to support more sparse representations. Experimental results on two different GPUs (Fermi GTX 580 and Maxwell GTX 980 Ti) show that we achieved more than 98% of the performance possible with a perfect selection.

parallel computing | 2014

An adaptive and hierarchical task scheduling scheme for multi-core clusters

Yizhuo Wang; Yang Zhang; Yan Su; Xiaojun Wang; Xu Chen; Weixing Ji; Feng Shi

An adaptive and hierarchical task scheduling scheme (AHS) is proposed.Work-sharing is used in conjunction with work-stealing.An initial partitioning is performed with respect to the pattern of task parallelism.A practical implementation of AHS is described.The theoretical, simulation and experimental studies of AHS are presented. Work-stealing and work-sharing are two basic paradigms for dynamic task scheduling. This paper introduces an adaptive and hierarchical task scheduling scheme (AHS) for multi-core clusters, in which work-stealing and work-sharing are adaptively used to achieve load balancing.Work-stealing has been widely used in task-based parallel programing languages and models, especially on shared memory systems. However, high inter-node communication costs hinder work-stealing from being directly performed on distributed memory systems. AHS addresses this issue with the following techniques: (1) initial partitioning, which reduces the inter-node task migrations; (2) hierarchical scheduling scheme, which performs work-stealing inside a node before going across the node boundary and adopts work-sharing to overlap computation and communication at the inter-node level; and (3) hierarchical and centralized control for inter-node task migration, which improves the efficiency of victim selection and termination detection.We evaluated AHS and existing work-stealing schemes on a 16-nodes multi-core cluster. Experimental results show that AHS outperforms existing schemes by 11-21.4%, for the benchmarks studied in this paper.

design, automation, and test in europe | 2013

A work-stealing scheduling framework supporting fault tolerance

Yizhuo Wang; Weixing Ji; Feng Shi; Qi Zuo

Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.

ieee international conference on high performance computing, data, and analytics | 2012

A fault tolerant self-scheduling scheme for parallel loops on shared memory systems

Yizhuo Wang; Alexandru Nicolau; Rosario Cammarota; Alexander V. Veidenbaum

As the number of cores per chip increases, significant speedup for many applications could be achieved by exploiting loop level parallelism (LLP). Meanwhile, ever scaling device size makes multicore/multiprocessor systems suffer from increased reliability problems. Scheduling scheme plays a key role to exploit LLP. In existing dynamic loop scheduling schemes, self-scheduling is the most commonly used scheme1. This paper presents FTSS, a fault tolerant self-scheduling scheme which aims to execute parallel loops efficiently in the presence of hardware faults on shared memory systems. Our technique transforms a loop to ensure the correctness of the re-execution of loop iterations by buffering variables with anti-dependences, which make it possible to design a fault tolerant loop scheduling scheme without checkpointing. FTSS combines work-stealing with self-scheduling, and uses a bidirectional execution model when work is stolen from a faulty core. Experimental results show that FTSS achieve better load balancing than existing self-scheduling schemes. Compared with checkpoint/restart implementations that save a checkpoint before executing each chunk of iterations and restart the whole chunk running on a faulty core, FTSS exhibits better runtime performance. In addition, FTSS greatly outperforms existing self-scheduling schemes in terms of performance and stability in heavy loaded runtime environment.

network and parallel computing | 2012

Knowledge-Based Adaptive Self-Scheduling

Yizhuo Wang; Weixing Ji; Feng Shi; Qi Zuo; Ning Deng

Loop scheduling scheme plays a critical role in the efficient execution of programs, especially loop dominated applications. This paper presents KASS, a knowledge-based adaptive loop scheduling scheme. KASS consists of two phases: static partitioning and dynamic scheduling. To balance the workload, the knowledge of loop features and the capabilities of processors are both taken into account using a heuristic approach in static partitioning phase. In dynamic scheduling phase, an adaptive self-scheduling algorithm is applied, in which two tuning parameters are set to control chunk sizes, aiming at load balancing and minimizing synchronization overhead. In addition, we extend KASS to apply on loop nests and adjust the chunk sizes at runtime. The experimental results show that KASS performs 4.8% to 16.9% better than the existing self- scheduling schemes, and up to 21% better than the affinity scheduling scheme.

network and parallel computing | 2012

Communication Locality Analysis of Triplet-Based Hierarchical Interconnection Network in Chip Multiprocessor

Shahnawaz Talpur; Feng Shi; Yizhuo Wang

Interconnection topology inside chip multiprocessor acts as fundamental role in communication locality. Considering compiler optimization data locality has been an inmost hypothesis in the high performance computing. Conversely, data locality sphere has several troubles when its degree of dimension is two or higher. In mesh network of two dimensions, each core is connected with its four neighbors. The data locality can potentially be exploited in two dimensions considering the specified processor’s perspective. A Triplet-Based Hierarchical Interconnection Network (TBHIN) has straightforward topology and fractal attribute for chip multiprocessor. In this paper, a static (no contention) performance analysis of TBHIN and 2-D mesh is presented, based on the premise of locality in communication. The dynamic (contention) software simulation of TBHIN shows that the stronger the locality in communication, the lower the delay of the communication.

Information Visualization | 2018

KernelGraph: Understanding the kernel in a graph:

Jianjun Shi; Weixing Ji; Jingjing Zhang; Zhiwei Gao; Yizhuo Wang; Feng Shi

The Linux kernel has grown to 20 million lines of code, which have been contributed by almost 14,000 programmers. The complexity of the Linux kernel challenges the kernel maintenance and makes comprehending the kernel more difficult for developers learning the kernel. Automated tool support is crucial for comprehending such a large-scale program involving a high volume of code. In this article, we present KernelGraph, which enhances understanding of the Linux kernel by providing a visual representation of kernel internals. KernelGraph resembles online map systems and facilitates kernel code navigation in an intuitive and interactive way. We describe the key techniques used in KernelGraph to process the vast amount of information in the kernel codebase quickly. We also implemented two applications built atop KernelGraph to enhance kernel comprehension. KernelGraph was presented to 30 participants, who were asked several questions about their kernel comprehension in a controlled study. Our experimental results show that, compared with other source code comprehension tools, KernelGraph improves kernel comprehension by enabling people to visually browse the kernel code and by providing an effective means for exploring the kernel structure. The ability to switch seamlessly between high-level views and source code significantly reduces the gap between source code and high-level mental representation. KernelGraph can be easily extended to support visualizations of other large-scale codebases.

ACM Transactions on Architecture and Code Optimization | 2018

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

Akrem Benatia; Weixing Ji; Yizhuo Wang; Feng Shi

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for the SpMV kernel on GPU. Indeed, serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this article BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. To do so, BestSF relies on a cost-sensitive classification system trained using Weighted Support Vector Machines (WSVMs) to predict the best sparse format for each input sparse matrix. Our experimental results on two different NVIDIA GPU architectures using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance improvement over using a single sparse format. While BestSF is trained to select the best sparse format in terms of performance (GFLOPS), our further experimental investigations revealed that using BestSF also led, in most of the test cases, to the best energy efficiency (MFLOPS/W). To prove its practical effectiveness, we also evaluate the performance and energy efficiency improvement achieved when using BestSF as a building block in a GPU-based Preconditioned Conjugate Gradient (PCG) iterative solver.

international green and sustainable computing conference | 2016

Energy evaluation of Sparse Matrix-Vector Multiplication on GPU

Akrem Benatia; Weixing Ji; Yizhuo Wang; Feng Shi

Many recent studies suggest that energy efficiency should be placed as a primary design goal on par with the performance in building both the hardware and the software. As a primary step toward finding a good compromise between these two conflicting design goals, first we need to have a deep understanding about the performance and the energy of different application kernels. In this paper, we focus on evaluating the energy efficiency of the Sparse Matrix-Vector Multiplication (SpMV), a very challenging kernel given its irregular aspect both in terms of memory access and control flow. In the present work, we consider the SpMV kernel under four different sparse formats (COO, CSR, ELL, and HYB) on GPU. Our experimental results obtained by using real world sparse matrices from the University of Florida collection on an NVIDIA Maxwell GPU (GTX 980Ti) show that there is no universal best sparse format in terms of energy efficiency. Furthermore, we identified some sparsity characteristics which are related to the energy efficiency of different sparse formats.

international conference on parallel and distributed systems | 2016

Machine Learning Approach for the Predicting Performance of SpMV on GPU

Akrem Benatia; Weixing Ji; Yizhuo Wang; Feng Shi

Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed recently for optimizing this kernel on the GPU side. Since the performance of the SpMV varies significantly according to the sparsity characteristics of the input matrix and the hardware features, developing an accurate performance model for this kernel is a challenging task. The traditional approach of building such models by analytical modeling is difficult in practice and requires a thorough understanding of the interaction between the GPU hardware and the sparse code. In this paper, we propose to use a machine learning approach to predict the performance of the SpMV kernel using several sparse formats (COO, CSR, ELL, and HYB) on GPU. We used two popular machine learning algorithms, Support Vector Regression (SVR) and Multilayer Perceptron neural network (MLP). Our experimental results on two different GPUs (Fermi GTX 512 and Maxwell GTX 980 Ti) show that the SVR models deliver the best accuracy with average prediction error ranging between 7% and 14%.

Explore More