Is this you? Create Your Porfile

Jiefeng Cheng

The Chinese University of Hong Kong

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiefeng Cheng is active.

Explore More

Publication

Featured researches published by Jiefeng Cheng.

international conference on data engineering | 2008

Fast Graph Pattern Matching

Jiefeng Cheng; Jeffrey Xu Yu; Bolin Ding; Philip S. Yu; Haixun Wang

Due to rapid growth of the Internet technology and new scientific/technological advances, the number of applications that model data as graphs increases, because graphs have high expressive power to model complicated structures. The dominance of graphs in real-world applications asks for new graph data management so that users can access graph data effectively and efficiently. In this paper, we study a graph pattern matching problem over a large data graph. The problem is to find all patterns in a large data graph that match a user-given graph pattern. We propose a new two-step R-join (reachability join) algorithm with filter step and fetch step based on a cluster- based join-index with graph codes. We consider the filter step as an R-semijoin, and propose a new optimization approach by interleaving R-joins with R-semijoins. We conducted extensive performance studies, and confirm the efficiency of our proposed new approaches.

knowledge discovery and data mining | 2010

Mining uncertain data with probabilistic guarantees

Liwen Sun; Reynold Cheng; David W. Cheung; Jiefeng Cheng

Data uncertainty is inherent in applications such as sensor monitoring systems, location-based services, and biological databases. To manage this vast amount of imprecise information, probabilistic databases have been recently developed. In this paper, we study the discovery of frequent patterns and association rules from probabilistic data under the Possible World Semantics. This is technically challenging, since a probabilistic database can have an exponential number of possible worlds. We propose two effcient algorithms, which discover frequent patterns in bottom-up and top-down manners. Both algorithms can be easily extended to discover maximal frequent patterns. We also explain how to use these patterns to generate association rules. Extensive experiments, using real and synthetic datasets, were conducted to validate the performance of our methods.

extending database technology | 2008

Fast computing reachability labelings for large graphs with high compression rate

Jiefeng Cheng; Jeffrey Xu Yu; Xuemin Lin; Haixun Wang; Philip S. Yu

There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O(|V| . |E|1/2) space, and be used to answer reachability query efficiently. However, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate. In this paper, we propose a hierarchical partitioning approach to partition a large graph G into two subgraphs repeatedly in a top-down fashion. The unique feature of our approach is that we compute 2-hop cover while partitioning. In brief, in every iteration of top-down partitioning, we provide techniques to compute the 2-hop cover for connections between the two subgraphs first. A cover is computed to cut the graph into two subgraphs, which results in an overall cover with high compression for the entire graph G. Two approaches are proposed, namely a node-oriented approach and an edge-oriented approach. Our approach can efficiently compute 2-hop cover for a large graph with high compression rate. Our extensive experiment studies show that the 2-hop cover for a graph with 1,700,000 nodes and 169 billion connections can be obtained in less than 30 minutes with a compression rate about 40,000 using a PC.

extending database technology | 2006

Fast computation of reachability labeling for large graphs

Jiefeng Cheng; Jeffrey Xu Yu; Xuemin Lin; Haixun Wang; Philip S. Yu

The need of processing graph reachability queries stems from many applications that manage complex data as graphs. The applications include transportation network, Internet traffic analyzing, Web navigation, semantic web, chemical informatics and bio-informatics systems, and computer vision. A graph reachability query, as one of the primary tasks, is to find whether two given data objects, u and v, are related in any ways in a large and complex dataset. Formally, the query is about to find if v is reachable from u in a directed graph which is large in size. In this paper, we focus ourselves on building a reachability labeling for a large directed graph, in order to process reachability queries efficiently. Such a labeling needs to be minimized in size for the efficiency of answering the queries, and needs to be computed fast for the efficiency of constructing such a labeling. As such a labeling, 2-hop cover was proposed for arbitrary graphs with theoretical bounds on both the construction cost and the size of the resulting labeling. However, in practice, as reported, the construction cost of 2-hop cover is very high even with super power machines. In this paper, we propose a novel geometry-based algorithm which computes high-quality 2-hop cover fast. Our experimental results verify the effectiveness of our techniques over large real and synthetic graph datasets.

extending database technology | 2009

On-line exact shortest distance query processing

Jiefeng Cheng; Jeffrey Xu Yu

Shortest-path query processing not only serves as a long established routine for numerous applications in the past but also is of increasing popularity to support novel graph applications in very large databases nowadays. For a large graph, there is the new scenario to query intensively against arbitrary nodes, asking to quickly return node distance or even shortest paths. And traditional main memory algorithms and shortest paths materialization become inadequate. We are interested in graph labelings to encode the underlying graphs and assign labels to nodes to support efficient query processing. Surprisingly, the existing work of this category mainly emphasizes on reachability query processing, while no sufficient effort has been given to distance labelings to support querying exact shortest distances between nodes. Distance labelings must be developed on the graph in whole to correctly retain node distance information. It makes many existing methods to be inapplicable. We focus on fast computing distance-aware 2-hop covers, which can encode the all-pairs shortest paths of a graph in O(|V|·|E|1/2) space. Our approach exploits strongly connected components collapsing and graph partitioning to gain speed, while it can overcome the challenges in correctly retaining node distance information and appropriately encoding all-pairs shortest paths with small overhead. Furthermore, our approach avoids pre-computing all-pairs shortest paths, which can be prohibitive over large graphs. We conducted extensive performance studies, and confirm the efficiency of our proposed new approaches.

Managing and Mining Graph Data | 2010

Graph Reachability Queries: A Survey

Jeffrey Xu Yu; Jiefeng Cheng

There are numerous applications that need to deal with a large graph, including bioinformatics, social science, link analysis, citation analysis, and collaborative networks. A fundamental query is to query whether a node is reachable from another node in a large graph, which is called a reachability query. In this survey, we discuss several existing approaches to process reachability queries. In addition, we will discuss how to answer reachability queries with the shortest distance, and graph pattern matching over a large graph.

automated software engineering | 2010

Matching dependence-related queries in the system dependence graph

Xiaoyin Wang; David Lo; Jiefeng Cheng; Lu Zhang; Hong Mei; Jeffrey Xu Yu

In software maintenance and evolution, it is common that developers want to apply a change to a number of similar places. Due to the size and complexity of the code base, it is challenging for developers to locate all the places that need the change. A main challenge in locating the places that need the change is that, these places share certain common dependence conditions but existing code searching techniques can hardly handle dependence relations satisfactorily. In this paper, we propose a technique that enables developers to make queries involving dependence conditions and textual conditions on the system dependence graph of the program. We carried out an empirical evaluation on four searching tasks taken from the development history of two real-world projects. The results of our evaluation indicate that, compared with code-clone detection, our technique is able to locate many required code elements that code-clone detection cannot locate, and compared with text search, our technique is able to effectively reduce false positives without losing any required code elements.

international conference on data engineering | 2015

VENUS: Vertex-centric streamlined graph computation on a single PC

Jiefeng Cheng; Qin Liu; Zhenguo Li; Wei Fan; John C. S. Lui; Cheng He

Recent studies show that disk-based graph computation on just a single PC can be as highly competitive as cluster-based computing systems on large-scale problems. Inspired by this remarkable progress, we develop VENUS, a disk-based graph computation system which is able to handle billion-scale problems efficiently on a commodity PC. VENUS adopts a novel computing architecture that features vertex-centric “streamlined” processing - the graph is sequentially loaded and the update functions are executed in parallel on the fly. VENUS deliberately avoids loading batch edge data by separating read-only structure data from mutable vertex data on disk. Furthermore, it minimizes random IOs by caching vertex data in main memory. The streamlined processing is realized with efficient sequential scan over massive structure data and fast feeding a large number of update functions. Extensive evaluation on large real-world and synthetic graphs has demonstrated the efficiency of VENUS. For example, VENUS takes just 8 minutes with hard disk for PageRank on the Twitter graph with 1.5 billion edges. In contrast, Spark takes 8.1 minutes with 50 machines and 100 CPUs, and GraphChi takes 13 minutes using fast SSD drive.

international conference on data engineering | 2013

Top-k graph pattern matching over large graphs

Jiefeng Cheng; Xianggang Zeng; Jeffrey Xu Yu

There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query (kGPM), and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting kGPM provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kGPM query. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a synthetic dataset and a real dataset, and we confirm the efficiency of our proposed approach.

IEEE Transactions on Knowledge and Data Engineering | 2011

Graph Pattern Matching: A Join/Semijoin Approach

Jiefeng Cheng; Jeffrey Xu Yu; Philip S. Yu

Due to rapid growth of the Internet and new scientific/technological advances, there exist many new applications that model data as graphs, because graphs have sufficient expressiveness to model complicated structures. The dominance of graphs in real-world applications demands new graph processing techniques to access large data graphs effectively and efficiently. In this paper, we study a graph pattern matching problem, which is to find all patterns in a large data graph that match a user-given graph pattern. We propose new two-step R-join (reachability join) algorithms with a filter step (R-semijoin) and a fetch step (R-join) by utilizing a new cluster-based join index with graph codes in a relational database context. We also propose two optimization approaches to further optimize sequences of R-joins/R-semijoins. The first approach is based on R-join order selection followed by R-semijoin enhancement, and the second approach is to interleave R-joins with R-semijoins. We conducted extensive performance studies, and confirm the efficiency of our proposed new approaches.

Explore More