Is this you? Create Your Porfile

Kai Lu

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kai Lu is active.

Explore More

Publication

Featured researches published by Kai Lu.

Journal of Computer Science and Technology | 2011

The TianHe-1A Supercomputer: Its Hardware and Software

Xuejun Yang; Xiangke Liao; Kai Lu; Qing-Feng Hu; Junqiang Song; Jinshu Su

This paper presents an overview of TianHe-1A (TH-1A) supercomputer, which is built by National University of Defense Technology of China (NUDT). TH-1A adopts a hybrid architecture by integrating CPUs and GPUs, and its interconnect network is a proprietary high-speed communication network. The theoretical peak performance of TH-1A is 4700 TFlops, and its LINPACK test result is 2566 TFlops. It was ranked the No. 1 on the TOP500 List released in November, 2010. TH-1A is now deployed in National Supercomputer Center in Tianjin and provides high performance computing services. TH-1A has played an important role in many applications, such as oil exploration, weather forecast, bio-medical research.

acm sigplan symposium on principles and practice of parallel programming | 2014

Efficient deterministic multithreading without global barriers

Kai Lu; Xu Zhou; Tom Bergan; Xiaoping Wang

Multithreaded programs execute nondeterministically on conventional architectures and operating systems. This complicates many tasks, including debugging and testing. Deterministic multithreading (DMT) makes the output of a multithreaded program depend on its inputs only, which can totally solve the above problem. However, current DMT implementations suffer from a common inefficiency: they use frequent global barriers to enforce a deterministic ordering on memory accesses. In this paper, we eliminate that inefficiency using an execution model we call deterministic lazy release consistency (DLRC). Our execution model uses the Kendo algorithm to enforce a deterministic ordering on synchronization, and it uses a deterministic version of the lazy release consistency memory model to propagate memory updates across threads. Our approach guarantees that programs execute deterministically even when they contain data races. We implemented a DMT system based on these ideas (RFDet) and evaluated it using 16 parallel applications. Our implementation targets C/C++ programs that use POSIX threads. Results show that RFDet gains nearly 2x speedup compared with DThreads-a start-of-the-art DMT system.

asia pacific workshop on systems | 2012

NV-process: a fault-tolerance process model based on non-volatile memory

Xu Li; Kai Lu; Xiaoping Wang; Xu Zhou

Reliability wall is one of the most challenging problems for next generation High Performance Computing (HPC) systems. Traditional system design adopts extra fault tolerance mechanism. However, the cost of fault tolerance mechanism itself may incur huge cost, so as to decrease the utilization ratio of the HPC system. To address this problem, we present NV-process, a fault-tolerance process model based on NVRAM. NV-process instances run in a self-contained way in NVRAM, thus to survive across operating system reboot. NV-process provides an elegant way for the applications to tolerate system crashes. We implement a prototype system of NV-process based on Linux and analyze the advantages over traditional fault tolerant mechanism for future HPC applications.

The Journal of Supercomputing | 2015

Detecting harmful data races through parallel verification

Zhendong Wu; Kai Lu; Xiaoping Wang; Xu Zhou; Chen Chen

Data races widely exist in concurrent programs and the harmful races have caused severe failures. To detect the harmful races, previous tools verify all the races, identifying the harmful ones. However, efficiency is affected when there are a large number of races needed to be verified. The multicore technology trend worsens this problem. Unlike previous work, to detect the harmful races, we try to improve the efficiency through parallel verification. We use imprecise race detection to find the races, including benign races and harmful races. The races are divided into many parts, and each part is sent to one machine for verification. On each machine, the races are verified dynamically, identifying the harmful races that would lead to program failures. To our knowledge, this is the first work that parallelizes race verification to improve the efficiency. We have experimented on a number of real-world concurrent programs and all the known harmful races in known benchmarks are detected. Additionally, our tool could scale well as the number of machines increases, and the speedup can also be increased linearly with the number of machines. Comparing with many previous tools, our work imposes lower runtime overhead.

acm sigplan symposium on principles and practice of parallel programming | 2013

RaceFree: an efficient multi-threading model for determinism

Kai Lu; Xu Zhou; Xiaoping Wang; Wenzhe Zhang; Gen Li

Current deterministic systems generally incur large overhead due to the difficulty of detecting and eliminating data races. This paper presents RaceFree, a novel multi-threading runtime that adopts a relaxed deterministic model to provide a data-race-free environment for parallel programs. This model cuts off unnecessary shared-memory communication by isolating threads in separated memories, which eliminates direct data races. Meanwhile, we leverage the happen-before relation defined by applications themselves as one-way communication pipes to perform necessary thread communication. Shared-memory communication is transparently converted to message-passing style communication by our Memory Modification Propagation (MMP) mechanism, which propagates local memory modifications to other threads through the happen-before relation pipes. The overhead of RaceFree is 67.2% according to our tests on parallel benchmarks.

parallel, distributed and network-based processing | 2015

RaceChecker: Efficient Identification of Harmful Data Races

Kai Lu; Zhendong Wu; Xiaoping Wang; Chen Chen; Xu Zhou

Data races hidden in concurrent programs have caused severe failures. To improve the reliability, many race detectors are proposed. However, most of the reported races are not harmful, which consumes manual effort to identify the harmful races. This paper proposes RaceChecker that can detect the potential races and identify the harmful races effectively and efficiently. Unlike previous detectors, RaceChecker combines happens-before relation and ad-hoc synchronization to prune the infeasible races so that fewer potential races are required to be verified. Before verification, RaceChecker groups the remaining potential races, guaranteeing the potential races in one group do not interfere with each other. Therefore, multiple potential races in one group can be verified together in one execution. To our knowledge, this is the first effective technique that groups the potential races to improve the efficiency. Unlike previous detectors that verify one potential race in one execution, RaceChecker dynamically controls thread scheduler to create real race conditions to verify multiple potential races in one execution, identifying the harmful races that cause program failures. We have implemented RaceChecker as a prototype tool and have experimented on a number of real-world concurrent programs. Results show that 66% of the potential races are infeasible and nearly 48% of the executions are reduced by the grouping strategy. The known harmful races are also identified effectively. By pruning and grouping, RaceChecker identifies the harmful races more efficiently. Comparing with RaceMob and RaceFuzzer, the time is reduced significantly, with an average of 45% and 81% respectively.

Archive | 2013

SCM-BSIM: A Non-Volatile Memory Simulator Based on BOCHS

Guoliang Zhu; Kai Lu; Xu Li

New storage-class memory (SCM) technologies, such as phase-change memory, are fast, non-volatile and byte-addressable. SCM provides a new realm for researchers to boost the performance of system. But most of SCM devices are not available on the market, which hindered further software research on leveraging the full feature of SCM. In this paper we design and implement a SCM device simulator on BOCHS named SCM-BSIM. SCM-BSIM can mimic full feature of SCM such as non-volatility and different access latency. Also it will gather life span statistics during simulation to support endurance relevant research. With a BOCHS-based interface, SCM-BSIM is easy to use.

international conference on computational and information sciences | 2012

SIM-PCM: A PCM Simulator Based on Simics

Xu Li; Kai Lu; Xu Zhou

Phase change memory (PCM) is an emerging memory technology which fast, byte-addressable and non-volatile. Because of these attractive features, PCM is the most promising technology to replace DRAM as main memory. Also, PCM provides new challenges for researchers to integrate it in the system memory hierarchy. Currently, there are several approaches provided to leverage the features of PCM to improve the performance of system. However, as PCM is not available in the market, current researches could only use DRAM to simulate PCM. What is more, there are no researches to simulate the full features of PCM. In this paper, we design and implement SIMPCM, a device-level PCM simulator. SIM-PCM is the first PCM Simulator that supports PCM features including latency and non-volatility. Also, SIM-PCM is very easy to use.

international conference on quality software | 2013

ColFinder Collaborative Concurrency Bug Detection

Zhendong Wu; Kai Lu; Xiaoping Wang; Xu Zhou

Many concurrency bugs are extremely difficult to be detected by random test due to huge input space and huge interleaving space. The multicore technology trend worsens this problem. We propose an innovative, collaborative approach called ColFinder to detect concurrency bugs effectively and efficiently. ColFinder uses static analysis to identify potential buggy statements. With respect to these statements, ColFinder uses program slicing to cut the original programs into smaller programs. Finally, it uses dynamic active test to verify whether the potential buggy statements will trigger real bugs. We implement a prototype of ColFinder, and evaluate it with several real-world programs. It significantly improves the probability of bug manifestation, from 0.75% to 89%. Additionally, ColFinder makes the time of bug manifestation obviously reduced by program slicing, with an average of 33%.

Journal of Computer Science and Technology | 2015

An Efficient and Flexible Deterministic Framework for Multithreaded Programs

Kai Lu; Xu Zhou; Xiaoping Wang; Tom Bergan; Chen Chen

Determinism is very useful to multithreaded programs in debugging, testing, etc. Many deterministic approaches have been proposed, such as deterministic multithreading (DMT) and deterministic replay. However, these systems either are inefficient or target a single purpose, which is not flexible. In this paper, we propose an efficient and flexible deterministic framework for multithreaded programs. Our framework implements determinism in two steps: relaxed determinism and strong determinism. Relaxed determinism solves data races efficiently by using a proper weak memory consistency model. After that, we implement strong determinism by solving lock contentions deterministically. Since we can apply different approaches for these two steps independently, our framework provides a spectrum of deterministic choices, including nondeterministic system (fast), weak deterministic system (fast and conditionally deterministic), DMT system, and deterministic replay system. Our evaluation shows that the DMT configuration of this framework could even outperform a state-of-the-art DMT system.

Explore More