Is this you? Create Your Porfile

Mian Lu

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mian Lu is active.

Explore More

Publication

Featured researches published by Mian Lu.

ACM Transactions on Database Systems | 2009

Relational query coprocessing on graphics processors

Bingsheng He; Mian Lu; Ke Yang; Rui Fang; Naga K. Govindaraju; Qiong Luo; Pedro V. Sander

Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries. In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.

data management on new hardware | 2010

Supporting extended precision on graphics processors

Mian Lu; Bingsheng He; Qiong Luo

Scientific computing applications often require support for non-traditional data types, for example, numbers with a precision higher than 64-bit floats. As graphics processors, or GPUs, have emerged as a powerful accelerator for scientific computing, we design and implement a GPU-based extended precision library to enable applications with high precision requirement to run on the GPU. Our library contains arithmetic operators, mathematical functions, and data-parallel primitives, each of which can operate at either multi-term or multi-digit precision. The multi-term precision maintains an accuracy of up to 212 bits of signifcand whereas the multi-digit precision allows an accuracy of an arbitrary number of bits. Additionally, we have integrated the extended precision algorithms to a GPU-based query processing engine to support efficient query processing with extended precision on GPUs. To demonstrate the usage of our library, we have implemented three applications: parallel summation in climate modeling, Newtons method used in nonlinear physics, and high precision numerical integration in experimental mathematics. The GPU-based implementation is up to an order of magnitude faster, and achieves the same accuracy as their optimized, quadcore CPU-based counterparts.

international conference on parallel processing | 2011

GSNP: A DNA Single-Nucleotide Polymorphism Detection System with GPU Acceleration

Mian Lu; Jiuxin Zhao; Qiong Luo; Bingqiang Wang; Shaohua Fu; Zhe Lin

We have developed GSNP, a software package with GPU acceleration, for single-nucleotide polymorphism detection on DNA sequences generated from second-generation sequencing equipment. Compared with SOAPsnp, a popular, high-performance CPU-based SNP detection tool, GSNP has several distinguishing features: First, we design a sparse data representation format to reduce memory access as well as branch divergence. Second, we develop a multipass sorting network to efficiently sort a large number of small arrays on the GPU. Third, we compute a table of frequently used scores once to avoid repeated, expensive computation and to reduce random memory access. Fourth, we apply customized compression schemes to the output data to improve the I/O performance. As a result, on a server equipped with an Intel Xeon E5630 2.53 GHZ CPU and an NVIDIA Tesla M2050 GPU, it took GSNP about two hours to analyze a whole human genome dataset whereas the CPU-based, single-threaded SOAPsnp took three days for the same task on the same machine.

Distributed and Parallel Databases | 2012

High-performance short sequence alignment with GPU acceleration

Mian Lu; Yuwei Tan; Ge Bai; Qiong Luo

Sequence alignment is a fundamental task for computational genomics research. We develop G-Aligner, which adopts the GPU as a hardware accelerator to speed up the sequence alignment process. A leading CPU-based alignment tool is based on the Bi-BWT index; however, a direct implementation of this algorithm on the GPU cannot fully utilize the hardware power due to its irregular algorithmic structure. To better utilize the GPU hardware resource, we propose a filtering-verification algorithm employing both the Bi-BWT search and direct matching. We further improve this algorithm on the GPU through various optimizations, e.g., the split of a large kernel, the warp based implementation to avoid user-level synchronization. As a result, G-Aligner outperforms another state-of-the-art GPU-accelerated alignment tools SOAP3 by 1.8–3.5 times for in-memory sequence alignment.

asia-pacific web conference | 2013

Accelerating Topic Model Training on a Single Machine

Mian Lu; Ge Bai; Qiong Luo; Jie Tang; Jiuxin Zhao

We present the design and implementation of GLDA, a library that utilizes the GPU (Graphics Processing Unit) to perform Gibbs sampling of Latent Dirichlet Allocation (LDA) on a single machine. LDA is an effective topic model used in many applications, e.g., classification, feature selection, and information retrieval. However, training an LDA model on large data sets takes hours, even days, due to the heavy computation and intensive memory access. Therefore, we explore the use of the GPU to accelerate LDA training on a single machine. Specifically, we propose three memory-efficient techniques to handle large data sets on the GPU: (1) generating document-topic counts as needed instead of storing all of them, (2) adopting a compact storage scheme for sparse matrices, and (3) partitioning word tokens. Through these techniques, the LDA training which would take 10 GB memory originally, can be performed on a commodity GPU card with only 1 GB GPU memory. Furthermore, our GLDA achieves a speedup of 15X over the original CPU-based LDA for large data sets.

statistical and scientific database management | 2012

Integrating GPU-accelerated sequence alignment and SNP detection for genome resequencing analysis

Mian Lu; Yuwei Tan; Jiuxin Zhao; Ge Bai; Qiong Luo

DNA sequence alignment and single-nucleotide polymorphism (SNP) detection are two important tasks in genomics research. A common genome resequencing analysis workflow is to first perform sequence alignment and then detect SNPs among the aligned sequences. In practice, the performance bottleneck in this workflow is usually the intermediate result I/O due to the separation of the two components, especially when the in-memory computation has been accelerated, e.g., by graphics processors. To address this bottleneck, we propose to integrate the two tasks tightly so as to eliminate the I/O of intermediate results in the workflow. Specifically, we make the following three changes for the tight integration: (1) we adopt a partition-based approach so that the external sorting of alignment results, which was required for SNP detection, is eliminated; (2) we perform customized compression on alignment results to reduce memory footprint; and (3) we move the computation of a global matrix from SNP detection to sequence alignment to save a file scan. We have developed a GPU-accelerated system that tightly integrates sequence alignment and SNP detection. Our results with human genome data sets show that our GPU-acceleration of individual components in the traditional workflow improves the overall performance by 18 times and that the tight integration further improves the performance of the GPU-accelerated system by 2.3 times.

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining | 2012

Accelerating minor allele frequency computation with graphics processors

Mian Lu; Jiuxin Zhao; Qiong Luo; Bingqiang Wang

The computation of minor allele frequency (MAF) is at the core of a Genome-Wide Association Study (GWAS). Due to the high computation intensity and high precision requirement, so far the scale of MAF computation analysis is up to hundreds of individuals. To enable the computation for thousands of individuals, we have developed GAMA, a high performance MAF computation program with GPU acceleration. Specifically, we design a parallel reduction algorithm that matches the GPUs data-parallel architecture. To implement the new algorithm efficiently on the GPU, we utilize the fast, on-chip local memory shared within each GPU multiprocessor effectively. To avoid user-level thread synchronization, we exploit the GPU thread-warp based scheduling. Furthermore, we address the floating point underflow issue through a logarithm transformation. As a result, GAMA enables MAF computation for up to a thousand individuals for the first time. On a server equipped with an NVIDIA Tesla C2070 GPU and two Intel Xeon E5520 2.27 GHz CPUs, GAMA outperforms a state-of-the-art single-threaded MAF computation tool and our optimized parallel implementation (16-threaded) on the CPU by around 47 and 3.5 times, respectively.

Big Data Management, Technologies, and Applications | 2014