Lizhu Zhou
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lizhu Zhou.
very large data bases | 2009
Zhiping Zeng; Anthony K. H. Tung; Jianyong Wang; Jianhua Feng; Lizhu Zhou
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.
knowledge discovery and data mining | 2006
Zhiping Zeng; Jianyong Wang; Lizhu Zhou; George Karypis
Frequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has been witnessed several applications and received considerable attention in the graph mining community recently. In this paper, we study how to efficiently mine the complete set of coherent closed quasi-cliques from large dense graph databases, which is an especially challenging task due to the downward-closure property no longer holds. By fully exploring some properties of quasi-cliques, we propose several novel optimization techniques, which can prune the unpromising and redundant sub-search spaces effectively. Meanwhile, we devise an efficient closure checking scheme to facilitate the discovery of only closed quasi-cliques. We also develop a coherent closed quasi-clique mining algorithm, <B>Cocain</B>1 Thorough performance study shows that Cocain is very efficient and scalable for large dense graph databases.
knowledge discovery and data mining | 2009
Yuzhou Zhang; Jianyong Wang; Yi Wang; Lizhu Zhou
Graphs or networks can be used to model complex systems. Detecting community structures from large network data is a classic and challenging task. In this paper, we propose a novel community detection algorithm, which utilizes a dynamic process by contradicting the network topology and the topology-based propinquity, where the propinquity is a measure of the probability for a pair of nodes involved in a coherent community structure. Through several rounds of mutual reinforcement between topology and propinquity, the community structures are expected to naturally emerge. The overlapping vertices shared between communities can also be easily identified by an additional simple postprocessing. To achieve better efficiency, the propinquity is incrementally calculated. We implement the algorithm on a vertex-oriented bulk synchronous parallel(BSP) model so that the mining load can be distributed on thousands of machines. We obtained interesting experimental results on several real network data.
ACM Transactions on Database Systems | 2007
Zhiping Zeng; Jianyong Wang; Lizhu Zhou; George Karypis
Due to the ability of graphs to represent more generic and more complicated relationships among different objects, graph mining has played a significant role in data mining, attracting increasing attention in the data mining community. In addition, frequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has witnessed several applications and received considerable attention in the graph mining community recently. In this article, we study how to efficiently mine the complete set of coherent closed quasi-cliques from large dense graph databases, which is an especially challenging task due to the fact that the downward-closure property no longer holds. By fully exploring some properties of quasi-cliques, we propose several novel optimization techniques which can prune the unpromising and redundant subsearch spaces effectively. Meanwhile, we devise an efficient closure checking scheme to facilitate the discovery of closed quasi-cliques only. Since large databases cannot be held in main memory, we also design an out-of-core solution with efficient index structures for mining coherent closed quasi-cliques from large dense graph databases. We call this Cocain*. Thorough performance study shows that Cocain* is very efficient and scalable for large dense graph databases.
international conference on data engineering | 2006
Jianyong Wang; Zhiping Zeng; Lizhu Zhou
Most previously proposed frequent graph mining algorithms are intended to find the complete set of all frequent, closed subgraphs. However, in many cases only a subset of the frequent subgraphs with a certain topology is of special interest. Thus, the method of mining the complete set of all frequent subgraphs is not suitable for mining these frequent subgraphs of special interest as it wastes considerable computing power and space on uninteresting subgraphs. In this paper we develop a new algorithm, CLAN, to mine the frequent closed cliques, the most coherent structures in the graph setting. By exploring some properties of the clique pattern, we can simplify the canonical label design and the corresponding clique (or subclique) isomorphism testing. Several effective pruning methods are proposed to prune the search space, while the clique closure checking scheme is used to remove the non-closed clique patterns. Our empirical results show that CLAN is very efficient for large dense graph databases with which the traditional graph mining algorithms fail. The novelty of our method is further demonstrated by the application of CLAN in mining highly correlated stocks from large stock market data.
asia-pacific web conference | 2010
Ju Fan; Hao Wu; Guoliang Li; Lizhu Zhou
Query term suggestion that interactively expands the queries is an indispensable technique to help users formulate high-quality queries and has attracted much attention in the community of web search. Existing methods usually suggest terms based on statistics in documents as well as query logs and external dictionaries, and they neglect the fact that the topic information is very crucial because it helps retrieve topically relevant documents. To give users gratification, we propose a novel term suggestion method: as the user types in queries letter by letter, we suggest the terms that are topically coherent with the query and could retrieve relevant documents instantly. For effectively suggesting highly relevant terms, we propose a generative model by incorporating the topical coherence of terms. The model learns the topics from the underlying documents based on Latent Dirichlet Allocation (LDA). For achieving the goal of instant query suggestion, we use a trie structure to index and access terms. We devise an efficient top-k algorithm to suggest terms as users type in queries. Experimental results show that our approach not only improves the effectiveness of term suggestion, but also achieves better efficiency and scalability.
extending database technology | 2009
Zhiping Zeng; Jianyong Wang; Jun Zhang; Lizhu Zhou
To our best knowledge, all existing graph pattern mining algorithms can only mine either closed, maximal or the complete set of frequent subgraphs instead of graph generators which are preferable to the closed subgraphs according to the Minimum Description Length principle in some applications. In this paper, we study a new problem of frequent subgraph mining, called frequent connected graph generator mining, which poses significant challenges due to the underlying complexity associated with frequent subgraph mining as well as the absence of Apriori property for graph generators. Whereas, we still present an efficient solution FOGGER for this new problem. By exploring some properties of graph generators, two effective pruning techniques, backward edge pruning and forward edge pruning, are proposed to prune the branches of the well-known DFS code enumeration tree that do not contain graph generators. To further improve the efficiency, an effective index structure, ADI++, is also devised to facilitate the subgraph isomorphism checking. We experimentally evaluate various aspects of FOGGER using both real and synthetic datasets. Our results demonstrate that the two pruning techniques are effective in pruning the unpromising parts of search space, and FOGGER is efficient and scalable in terms of the base size of input databases. Meanwhile, the performance study for graph generator-based classification model shows that generator-based model is much simpler and can achieve almost the same accuracy for classifying chemical compounds in comparison with closed subgraph-based model.
web intelligence | 2007
Ling Lin; Lizhu Zhou
Mood classification for blogs is useful in helping user-to-agent interaction for a variety of applications involving the web, such as user modeling, recommendation systems, and user interface fields. It is challenging at the same time because of the diversity of the characteristics of bloggers, their experiences, and the way moods are expressed. As an attempt to handle the diversity, we combine multiple sources of evidence for a mood type. Support vector machine based mood classifier (SVMMC) is integrated with mood flow analyzer (MFA) that incorporates commonsense knowledge obtained from the general public (i.e. ConceptNet), the affective norms english words (ANEW) list, and mood transitions. In combining the two different approaches, we employ a statistically weighted voting scheme based on the support vector machine (SVM). For evaluation, we have built a mood corpus consisting of manually annotated blogs, which amounts to over 4000 blogs. Our proposed method outperforms SVMMC by 5.68% in precision. The improvement is attributed to the strategy of choosing more trustable classification results in an interleaving fashion between the SVMMC and our MFA.Data-rich webpages are providing an increasingly important data source for web applications. While the problem of data object recognition is intensively discussed, it is mostly addressed as a separated process from the frontier task of relevant webpage identification. In this paper, we propose a method to leverage the classification result of data-rich webpages for efficient and scalable data object recognition. A novel context information is proposed, which can be inferred from the webpage classification and exploited in the bottom-up data object recognition. Experimental results show that the context information brings a 19% improvement in the running efficiency of the bottom- up data object recognition.
knowledge discovery and data mining | 2008
Zhiping Zeng; Jianyong Wang; Lizhu Zhou
Distinguishing patterns represent strong distinguishing knowledge and are very useful for constructing powerful, accurate and robust classifiers. The distinguishing graph patterns(DGPs) are able to capture structure differences between any two categories of graph datasets. Whereas, few previous studies worked on the discovery of DGPs. In this paper, as the first, we study the problem of mining the complete set of minimal DGPs with any number of positive graphs, arbitrary positive support and negative support. We proposed a novel algorithm, MDGP-Mine, to discover the complete set of minimal DGPs. The empirical results show that MDGP-Mine is efficient and scalable.
RED'09 Proceedings of the 2nd international conference on Resource discovery | 2009
Ling Lin; Lizhu Zhou
Web databases provide different types of query interfaces to access the data records stored in the backend databases. While most existing works exploit a complex query interface with multiple input fields to perform schema identification of the Web databases, little attention has been paid on how to identify the schema of web databases by simple query interface (SQI), which has only one single query text input field. This paper proposes a new method of instance-based query probing to identify WDBs interface and result schema for SQI. The interface schema identification problem is defined as generating the full-condition query of SQI and a novel query probing strategy is proposed. The result schema is also identified based on the result webpages of SQIs full-condition query, and an extended identification of the non-query attributes is proposed to improve the attribute recall rate. Experimental results on web databases of online shopping for book, movie and mobile phone show that our method is effective and efficient.