Mai Zheng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mai Zheng is active.

Explore More

Publication

Featured researches published by Mai Zheng.

acm sigplan symposium on principles and practice of parallel programming | 2011

GRace: a low-overhead mechanism for detecting data races in GPU programs

Mai Zheng; Vignesh T. Ravi; Feng Qin; Gagan Agrawal

In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs; 2) lacking of scalability due to the state explosion problem; 3) reporting many false positives because of simplified modeling; and/or 4) incurring prohibitive runtime and space overhead. In this paper, we propose GRace, a new mechanism for detecting races in GPU programs that combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GRace can accurately detect data races with no false positives reported. Based on the above idea, we have built a prototype of GRace with two schemes, i.e., GRace-stmt and GRace-addr, for NVIDIA GPUs. Both schemes are integrated with the same static analysis. We have evaluated GRace-stmt and GRace-addr with three data race bugs in three GPU kernel functions and also have compared them with the existing approach, referred to as B-tool. Our experimental results show that both schemes of GRace are effective in detecting all evaluated cases with no false positives, whereas Btool reports many false positives for one evaluated case. On the one hand, GRace-addr incurs low runtime overhead, i.e., 22-116%, and low space overhead, i.e., 9-18MB, for the evaluated kernels. On the other hand, GRace-stmt offers more help in diagnosing data races with larger overhead.

working conference on reverse engineering | 2012

Modeling Software Execution Environment

Dawei Qi; William N. Sumner; Feng Qin; Mai Zheng; Xiangyu Zhang; Abhik Roychoudhury

Software execution environment, interfaced with software through library functions and system calls, constitutes an important aspect of the softwares semantics. Software analysis ought to take the execution environment into consideration. However, due to lack of source code and the inherent implementation complexity of these functions, it is quite difficult to co-analyze software and its environment. In this paper, we propose to extend program synthesis techniques to construct models for system and library functions. The technique samples the behavior of the original implementation of a function. The samples are used as the specification to synthesize the model, which is a C program. The generated model is iteratively refined. We have developed a prototype that can successfully construct models for a pool of system and library calls from their real world complex binary implementations. Moreover, our experiments have shown that the constructed models can improve dynamic test generation and failure tolerance.

IEEE Transactions on Parallel and Distributed Systems | 2014

GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme

Mai Zheng; Vignesh T. Ravi; Feng Qin; Gagan Agrawal

In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. While languages like CUDA and OpenCL have eased GPU programming for nongraphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. In this paper, we propose GMRace, a new mechanism for detecting races in GPU programs. GMRace combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GMRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GMRace can accurately detect data races with no false positives reported. Our experimental results show that comparing to previous approaches, GMRace is more effective in detecting races in the evaluated cases, and incurs much less runtime and space overhead.

ACM Transactions on Computer Systems | 2017

Reliability Analysis of SSDs Under Power Fault

Mai Zheng; Joseph Tucek; Feng Qin; Mark David Lillibridge; Bill W. Zhao; Elizabeth S. Yang

Modern storage technology (solid-state disks (SSDs), NoSQL databases, commoditized RAID hardware, etc.) brings new reliability challenges to the already-complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue in this dependability-critical area. Understanding how new storage components behave under power fault is the first step towards designing new robust storage systems. In this article, we propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test 17 commodity SSDs from six different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that 14 of the 17 tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

modeling, analysis, and simulation on computer and telecommunication systems | 2013

LiU: Hiding Disk Access Latency for HPC Applications with a New SSD-Enabled Data Layout

Dachuan Huang; Xuechen Zhang; Wei Shi; Mai Zheng; Song Jiang; Feng Qin

Unlike in the consumer electronics and personal computing areas, in the HPC environment hard disks can hardly be replaced by SSDs. The reasons include hard disks large capacity, very low price, and decent peak throughput. However, when latency dominates the I/O performance (e.g., when accessing random data), the hard disks performance can be compromised. If the issue of high latency could be effectively solved, the HPC community would enjoy a large, affordable and fast storage without having to replace disks completely with expensive SSDs. In this paper, we propose an almost latency-free hard-disk dominated storage system called LiU for HPC. The key technique is leveraging limited amount of SSD storage for its low-latency access, and changing data layout in a hybrid storage hierarchy with low-latency SSD at the top and high-latency hard disk at the bottom. If a segment of data would be randomly accessed, we lift its top part (the head) up in the hierarchy to the SSD and leave the remaining part (the body) untouched on the disk. As a result, the latency of accessing this whole segment can be removed because access latency of the body can be hidden by the access time of the head on the SSD. Combined with the effect of prefetching a large segment, LiU (Lift it Up) can effectively remove disk access latency so disks high peak throughput can now be fully exploited for data-intensive HPC applications. We have implemented a prototype of LiU in the PVFS parallel file system and evaluated it with representative MPI-IO micro benchmarks, including MPI-IO-test, mpi-tile-io, and ior-mpi-io, and one macro-benchmark BTIO. Our experimental results show that LiU can effectively improve the I/O performance for HPC applications, with the throughput improvement ratio up to 5.8. Furthermore, LiU can bring much more benefits to sequential-I/O MPI applications when the applications are interfered by other workloads. For example, LiU improves the I/O throughput of mpi-io-test, which is under interference, by 1.1-3.4 times, while improving the same workload without interference by 15%.

networking architecture and storages | 2016

Emulating Realistic Flash Device Errors with High Fidelity

Simeng Wang; Jinrui Cao; Danny V. Murillo; Yiliang Shi; Mai Zheng

Modern storage software is designed to guarantee data integrity and consistency based on decades of experience with the foibles of hard disk drives. However, recent research shows that flash-based SSDs may fail in different and surprising ways, breaking their contract with the software above them. This raises the question of whether the software stacks guarantees to users still hold when SSDs are substituted. In this position paper, we propose a framework to emulate the erroneous behaviors of SSDs for understanding the failure resilience of the storage software stack. We first model the device behaviors reported in previous work and create a database of realistic patterns of device errors. Based on the patterns, the framework manipulates the I/O commands at the driver level and emulates the device errors with minimal disturbance to the target software. Preliminary results show that the framework is able to emulate the device errors with high fidelity, which provides a solid foundation for further studying the failure handling of the storage software stack.

ieee international conference on high performance computing, data, and analytics | 2012

GMProf: A low-overhead, fine-grained profiling approach for GPU programs

Mai Zheng; Vignesh T. Ravi; Wenjing Ma; Feng Qin; Gagan Agrawal

Driven by the cost-effectiveness and the power-efficiency, GPUs are being increasingly used to accelerate computations in many domains. However, developing highly efficient GPU implementations requires a lot of expertise and effort. Thus, tool support for tuning GPU programs is urgently needed, and more specifically, low-overhead mechanisms for collecting fine-grained runtime information are critically required. Unfortunately, profiling tools and mechanisms available today either collect very coarse-grained information, or have prohibitive overheads. This paper presents a low-overhead and fine-grained profiling technique developed specifically for GPUs, which we refer to as GMProf. GMProf uses two ideas to help reduce the overheads of collecting fine-grained information. The first idea involves exploiting a number of GPU architectural features to collect reasonably accurate information very efficiently, and the second idea is to use simple static analysis methods to reduce the overhead of runtime profiling. The specific implementation of GMProf we report in this paper focuses on shared memory usage. Particularly, we help programmers understand (1) which locations in shared memory are infrequently accessed? and (2) which data elements in device memory are frequently accessed? We have evaluated GMProf using six popular GPU kernels with different characteristics. Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling. Furthermore, for three of the six evaluated kernels, GMProf verified that shared memory is effectively used, and for the remaining three kernels, it not only helped accurately identify the inefficient use of shared memory, but also helped tune the implementations. The resulting tuned implementations had a speedup of 15.18 times on average.

2017 International Conference on Computing, Networking and Communications (ICNC) | 2017

A command-level study of Linux kernel bugs

Yiliang Shi; Danny V. Murillo; Simeng Wang; Jinrui Cao; Mai Zheng

As computer systems increase in size and complexity, bugs become ever subtler and more difficult to detect and diagnose. A bug could exist at different layers of computer systems (e.g., applications, shared libraries, file systems, device firmware), or could be caused by the incompatibility among layers. In many cases, bugs would require a very specific combination of events to be triggered and are difficult to replicate, making detection and diagnosis more complicated. Most existing tools for debugging focus on a single layer of the systems (e.g., applications), which are intrusive to the target layer and are fundamentally limited for analyzing issues involving multiple layers. As the first step towards building a multi-layer diagnostic framework, this paper presents our efforts to study the behaviors of Linux kernel bugs at the host-device interface. More specifically, we designed workloads to trigger known kernel bugs, recorded the SCSI commands observed under different kernels, and analyzed the impact of kernel bug patches after multiple runs of the workloads by counting the occurrence of individual SCSI commands. Our preliminary results show that it is possible to identify potential synchronization bugs in the kernel from information at the host-device interface level.

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) | 2016

A generic framework for testing parallel file systems

Jinrui Cao; Simeng Wang; Dong Dai; Mai Zheng; Yong Chen

Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.

international conference on supercomputing | 2018

PFault: A General Framework for Analyzing the Reliability of High-Performance Parallel File Systems

Jinrui Cao; Om Rameshwar Gatla; Mai Zheng; Dong Dai; Vidya Eswarappa; Yan Mu; Yong Chen

High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustres checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustres internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.

Explore More