Zhiyuan Shao
Huazhong University of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhiyuan Shao.
international conference on parallel and distributed systems | 2011
Xiaowen Feng; Hai Jin; Ran Zheng; Kan Hu; Jingxiang Zeng; Zhiyuan Shao
Sparse Matrix-Vector multiplication (SpMV) is one of the most significant yet challenging issues in computational science area. It is a memory-bound application whose performance mostly depends on the input matrix and the underlying architecture. Many researchers have paid more attentions on exploring a variety of optimization techniques to SpMV. One of the most promising respects is how to adapt the storage format to satisfy the underlying architecture. Alterative storage formats can largely lessen memory pressure, however, the computational resources are often underutilized. Therefore, a new storage format, which is called Compressed Sparse Row with Segmented Interleave Combination (SIC), is proposed. Stemming from Compressed Sparse Row format (CSR), SIC format employs an interleave combination pattern that combines certain amount of CSR rows to form a new SIC row. In order to further improve performance, segmented processing is also brought in. According to the empirical data, we also develop an automatic SIC-based SpMV suitable for all the matrices. Experimental results show that our approach outperforms the NVIDIA CSR vector kernel, achieving up to 12.6 × speedup. It also demonstrates a comparable performance with the Hybrid format, even with the highest 2.89 × speedup.
international symposium on parallel and distributed processing and applications | 2009
Zhiyuan Shao; Hai Jin; Yong Li
In this paper, we propose a scheme that manages the computational resource of virtual machines that are used to host high performance computing applications. Different from the static configuration methodology employed by the state-of-art virtual machine monitors, in our scheme, the virtual machines are automatically configured according to the actual load generated by the applications. NPB, HPL and kernel compilation are chosen as representative high performance computing applications to run inside the virtual machine constructed using our scheme, and the performance of such applications are compared with that obtained from the statically configured virtual machines. The comparison indicates that besides the great flexibility it brings, the performance penalty resulted by our scheme is below 5% in most cases, and the performance of the application running inside the automatically configured virtual machine is even better than that running inside the statically configured ones in some cases.
virtual execution environments | 2009
Huacai Chen; Hai Jin; Zhiyuan Shao; Ke Yu; Kun Tian
As an emerging trend, virtualization is more and more widely used in todays computing world. But, the introduc-tion of virtual machines bring trouble for the power man-agement (PM for short), since the operating system can not directly access and control the hardware as before. Solu-tions were proposed to manage the power in the server con-solidation case. However, such solutions are VMM-centric: the VMM gathers the PM decisions of the guests as hints, and makes the final decision to manipulate the hardware. These solutions do not fit well for the virtualized desktop environment, which is highly interactive with the users. In this paper, we propose a novel solution, called Cli-entVisor, to manage the power in the virtualized desktop environment. The key idea of our scheme is to leverage the functionalities of the Commercial-Off-The-Shelf (COTS) operating system, which actually interacts with the user, to manage the power of the processor and the peripheral de-vices in all possible cases. VMM coordinates the PM deci-sions of the guests only at the key points. By prototype implementation and experiments, we find our scheme re-sults in 22% lower power consumption in the static power usage scenario, and about 8% lower in the dynamic sce-nario than the corresponding cases of Xen. Moreover, the experimental data shows that the deployment of our scheme will not deteriorate the user experience.
international performance computing and communications conference | 2003
Zhiyuan Shao; Hai Jin; Bin Chen; Jie Xu; Jianhui Yue
Improving the availability of services of is a key issue for survivability of a cluster system. Lots of schemes are proposed for this purpose. But most of them aim at enhancing only the service-level availability or application specific. In this paper, we propose a scheme called High Availability with Redundant TCP Stacks (HARTs), providing connection-level availability by exploring the redundant TCP stacks for TCP connections at the server side. We present our performance experiment results on our HA cluster prototype. From results, we find the configuration of one primary server with one backup server running on separated 100 Mbps Ethernet has acceptable performance to support the server side applications while delivering high availability.
Concurrency and Computation: Practice and Experience | 2014
Xiaowen Feng; Hai Jin; Ran Zheng; Zhiyuan Shao; Lei Zhu
The challenge for Sparse Matrix–Vector multiplication (SpMV) performance is memory bandwidth, which mostly depends on input matrices and underlying computing platforms. To solve this challenge, many researchers have explored a variety of optimization techniques. One of the most promising aspects focuses on designing storage formats to represent sparse matrices. However, lots of prior storage formats cannot fully take advantage of the underlying computing platforms, resulting in unsatisfactory performance and large memory footprint. Therefore, a novel storage format, called Segmented Hybrid ELL + Compressed Sparse Row (CSR) (SHEC for short), is proposed to further improve the throughput and lessen memory footprint on Graphics Processing Unit (GPU). SHEC format employs an interleaved combination pattern, which combines certain amount of compressed rows to form a new SHEC row. Segmentation is brought in to balance load and reduce memory footprint. According to the empirical data, an automatic SHEC‐based SpMV is developed to fit for all the matrices. Experimental results show that SHEC approach outperforms the best results of NVIDIA SpMV library and exhibits a comparable performance with state‐of‐the‐art storage formats on the standard dataset. Copyright
grid and pervasive computing | 2013
Xiaohu Bai; Hai Jin; Xiaofei Liao; Xuanhua Shi; Zhiyuan Shao
Replica management has become a hot research topic in storage systems. This paper presents a dynamic replica management strategy based on response time, named RTRM. RTRM strategy consists of replica creation, replica selection, and replica placement mechanisms. RTRM sets a threshold for response time, if the response time is longer than the threshold, RTRM will increase the number of replicas and create new replica. When a new request comes, RTRM will predict the bandwidth among the replica servers, and make the replica selection accordingly. The replica placement refers to search new replica placement location, and it is a NP-hard problem. Based on graph theory, this paper proposes a reduction algorithm to solve this problem. The simulation results show that RTRM strategy performs better than the five built-in replica management strategies in terms of network utilization and service response time.
Future Generation Computer Systems | 2013
Zhiyuan Shao; Ligang He; Zhiqiang Lu; Hai Jin
Nowadays, it is an important trend in the system domain to use the software-based virtualization technology to build the execution environments (e.g., the Clouds). After introducing the virtualization layer, there exist two schedulers: One in the hypervisor and the other inside the Guest Operating System (GOS). To fully understand the virtualized system and identify the possible reasons for performance problems incurred by the virtualization technology, it is very important for the system administrators and engineers to know the scheduling behavior of the hypervisor, in addition to understanding the scheduler inside the GOS. In this paper, we develop a virtualization scheduling analyzer, called VSA, to analyze the trace data of the Xen virtual machine monitor. With VSA, one can easily obtain the scheduling data associated with virtual processors (i.e., VCPUs) and physical processors (i.e., PCPUs), and further conduct the scheduling analysis for a group of interacting VCPUs running in the same domain.
modeling, analysis, and simulation on computer and telecommunication systems | 2011
Zhiyuan Shao; Qiang Wang; Xuejiao Xie; Hai Jin; Ligang He
Nowadays, it is an important trend in the system domain to use the software-based virtualization technology to build the execution environments (e.g., Clouds) and serve high performance computing (HPC) applications. However, with the extra virtualization layer, the application performance may be negatively affected. Studies revealed that the communication performance of the MPI library, which is widely used by the HPC applications, would suffer a high penalty when a physical host machine becomes overcommitted by virtual processors (VCPU). Unfortunately, the problem has not received enough attention and has not been solved yet in literature. In this paper, we investigate the reasons behind the performance penalty, and propose a solution to improve the communication performance of running MPI applications in the overcommitted virtualized systems. The experimental results show that by our proposal, most HPC applications can gain performance improvement to different extents among the overcommitted systems, depending on their communication patterns and the over committing level.
multimedia and ubiquitous engineering | 2008
Bo Li; Hai Jin; Zhiyuan Shao; Yong Li; Xin Liu
Ray tracing is a global illumination based on rendering method. It could produce very high-quality image. But rendering is also a very time-consuming procedure. In this paper, we map a Whitted-style recursive ray tracing on Cell Broadband Engine processor (Cell BE) with a number of optimization techniques according to the architecture characteristics of Cell BE processor, which include adaptive task scheduling, software managed cache, double buffering, packets of primary rays through SIMD unit. Through experiments, we show our implementation can harness the Cell BE processing power to up-limit.
international workshop on education technology and computer science | 2009
Zhiyuan Shao; Hai Jin; Xiaowen Lu
Performance monitor for virtual machines can monitor the performance metrics of virtual machines and gain the resource consumption status, and thus provide reliable basis for system performance evaluation and further management. However, at current stage, such performance monitor systems for the virtualized systems are inadequate and inefficient. In this paper, we propose a lightweight performance monitor system model for virtual machines, called PMonitor. Through prototype implementation and experiments, we find that PMonitor consumes very few processing power, which is much lower than other open source performance monitors running on the virtualized environments.