Is this you? Create Your Porfile

Chunbin Lin

University of California, San Diego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chunbin Lin is active.

Explore More

Publication

Featured researches published by Chunbin Lin.

international conference on management of data | 2017

An Experimental Study of Bitmap Compression vs. Inverted List Compression

Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson

Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community and many inverted list compression algorithms were developed as well, e.g., VB, PforDelta, GroupVB, Simple8b, and SIMDPforDelta. We observe that they essentially solve the same problem, i.e., how to store a collection of sorted integers with as few as possible bits and support query processing as fast as possible. Due to historical reasons, bitmap compression and inverted list compression were developed as two separated lines of research in the database area and information retrieval area. Thus, a natural question is: Which one is better between bitmap compression and inverted list compression? To answer the question, we present the first comprehensive experimental study to compare a series of 9 bitmap compression methods and 12 inverted list compression methods. We compare these 21 algorithms on synthetic datasets with different distributions (uniform, zipf, and markov) as well as 8 real-life datasets in terms of the space overhead, decompression time, intersection time, and union time. Based on the results, we provide many lessons and guidelines that can be used for practitioners to decide which technique to adopt in future systems and also for researchers to develop new algorithms.

very large data bases | 2016

HippogriffDB: balancing I/O and GPU bandwidth in big data analytics

Jing Li; Hung-Wei Tseng; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson

As data sets grow and conventional processor performance scaling slows, data analytics move towards heterogeneous architectures that incorporate hardware accelerators (notably GPUs) to continue scaling performance. However, existing GPU-based databases fail to deal with big data applications efficiently: their execution model suffers from scalability limitations on GPUs whose memory capacity is limited; existing systems fail to consider the discrepancy between fast GPUs and slow storage, which can counteract the benefit of GPU accelerators. In this paper, we propose HippogriffDB, an efficient, scalable GPU-accelerated OLAP system. It tackles the bandwidth discrepancy using compression and an optimized data transfer path. HippogriffDB stores tables in a compressed format and uses the GPU for decompression, trading GPU cycles for the improved I/O bandwidth. To improve the data transfer efficiency, HippogriffDB introduces a peer-to-peer, multi-threaded data transfer mechanism, directly transferring data from the SSD to the GPU. HippogriffDB adopts a query-over-block execution model that provides scalability using a stream-based approach. The model improves kernel efficiency with the operator fusion and double buffering mechanism. We have implemented HippogriffDB using an NVMe SSD, which talks directly to a commercial GPU. Results on two popular benchmarks demonstrate its scalability and efficiency. HippogriffDB outperforms existing GPU-based databases (YDB) and in-memory data analytics (MonetDB) by 1-2 orders of magnitude.

very large data bases | 2017

MILC: inverted list compression in memory

Jianguo Wang; Chunbin Lin; Ruining He; Moojin Chae; Yannis Papakonstantinou; Steven Swanson

Inverted list compression is a topic that has been studied for 50 years due to its fundamental importance in numerous applications including information retrieval, databases, and graph analytics. Typically, an inverted list compression algorithm is evaluated on its space overhead and query processing time. Earlier list compression designs mainly focused on minimizing the space overhead to reduce expensive disk I/O time in disk-oriented systems. But the recent trend is shifted towards reducing query processing time because the underlying systems tend to be memory-resident. Although there are many highly optimized compression approaches in main memory, there is still a considerable performance gap between query processing over compressed lists and uncompressed lists, which motivates this work. In this work, we set out to bridge this performance gap for the first time by proposing a new compression scheme, namely, MILC (memory inverted list compression). MILC relies on a series of techniques including offset-oriented fixed-bit encoding, dynamic partitioning, in-block compression, cache-aware optimization, and SIMD acceleration. We conduct experiments on three real-world datasets in information retrieval, databases, and graph analytics to demonstrate the high performance and low space overhead of MILC. We compare MILC with 12 recent compression algorithms and experimentally show that MILC improves the query performance by up to 13.2× and reduces the space overhead by up to 4.7×.

very large data bases | 2016

Fast in-memory SQL analytics on typed graphs

Chunbin Lin; Benjamin Mandel; Yannis Papakonstantinou; Matthias Springer

We study a class of graph analytics SQL queries, which we call relationship queries. These queries involving aggregation, join, semijoin, intersection and selection are a wide superset of fixed-length graph reachability queries and of tree pattern queries. We present real-world OLAP scenarios, where efficient relationship queries are needed. However, row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. We propose a GQ-Fast database, which is an indexed database that roughly corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access. We used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2--4 orders of magnitude and MonetDB, Vertica and Neo4j by 1--3 orders of magnitude when all of them are running on RAM. Our experiments dissect GQ-Fasts advantage between (i) the use of compiled code, (ii) the bottom-up pipelining execution strategy, and (iii) the use of dense structures. Other analysis and experiments show the space savings of GQ-Fast due to the appropriate use of compression methods. We also show that the runtime penalty incurred by the dense compression methods decreases as the number of CPU cores increases.

international conference on data engineering | 2017

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Chuitian Rong; Chunbin Lin; Yasin N. Silva; Jianguo Wang; Wei Lu; Xiaoyong Du

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicatefree framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.

international world wide web conferences | 2017

Location-sensitive Query Auto-completion

Chunbin Lin; Jianguo Wang; Jiaheng Lu

This paper studies the location-sensitive auto-completion problem. We propose an efficient algorithm SQA running on a native index combining both IR-tree and Trie index. The experiments on real-life datasets demonstrate that SQA outperforms baseline methods by one order of magnitude.

international world wide web conferences | 2017

SpiderX: Fast XML Exploration System

Chunbin Lin; Jianguo Wang

Keyword search in XML has gained popularity as it enables users to easily access XML data without the need of learning query languages and studying complex data schemas. In XML keyword search, query semantics is based on the concept of Lowest Common Ancestor (LCA), e.g., SLCA and ELCA. However, LCA-based search methods depend heavily on hierarchical structures of XML data, which may result in meaningless answers. To obtain desired answers, a successful system should be able to (i) match a semantic entity for each keyword, (ii) discover the relationships of the matched entities, (iii) support efficient query processing, (iv) release users from having the knowledge of the XML content, and (v) visualize the search results. None of the existing XML keyword search systems completely meet the above requirements. In this paper, we design a system called SpiderXto completely solves the above challenges. We propose a query semantics Entity-Relationship Graph (ERG), which adopts the RDF subject-predicate-object semantics to capture the information of search entities along with associated attributes and the relationships between entities. SpiderX proposes a novel index structure, which has small space cost by combining the optimizations of column databases and the data compression schemes. In addition, SpiderX processes queries in a bottom-up way to achieve high performance, which is about 100X faster than the state-of-the-art algorithms. To demonstrate the high performance of SpiderX, we implement an online demo for SpiderX, which operating on three real-life datasets. The demo also provides (1) query auto-completion to guide users to formulate queries; and (2) visualization panel to display the query answers, which interacts with users by providing zoom-in and zoom-out exploration features. Demo link: http://chunbinlin.com/spiderx.

international conference on data engineering | 2017

GQFast: Fast Graph Exploration with Context-Aware Autocompletion

Chunbin Lin; Jianguo Wang; Yannis Papakonstantinou

There is an increasing demand to explore similar entities in big graphs. For example, in domains like biomedical science, identifying similar entities may contribute to developing new drugs or discovering new diseases. In this paper, we demonstrate a graph exploration system, called GQFast, which provides a graphical interface to help users efficiently explore similar entities. Methodologically, GQFast first builds efficient indices combining column database optimizations and compression techniques, then it explores similar entities by using the indices. GQFast operates on the real-world Pubmed dataset consisting of over 23 million biomedical entities and 1.3 billion relationships. Relying on GQFasts high performance, GQFast provides (i) type-ahead-search to instantly visualize search results while a user is typing a query, and (ii) context-aware query completion to guide users typing queries.

international joint conference on artificial intelligence | 2016