Zhenglu Yang
Nankai University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhenglu Yang.
web age information management | 2007
Zhenglu Yang; Botao Wang; Masaru Kitsuregawa
Recent studies on personalized search have shown that user preferences could be learned implicitly. As far as we know, these studies, however, neglect that user preferences are likely to change over time. This paper introduces an adaptive scheme to learn the changes of user preferences from click-history data, and a novel rank mechanism to bias the search results of each user. We propose independent models for long-term and short-term user preferences to compose our user profile. The proposed user profile contains a taxonomic hierarchy for the long-term model and a recently visited page-history buffer for the short-term model. Dynamic adaptation strategies are devised to capture the accumulation and degradation changes of user preferences, and adjust the content and the structure of the user profile to these changes. Experimental results demonstrate that our scheme is efficient to model the up-to-date user profile, and that the rank mechanism based on this scheme can support web search systems to return the adequate results in terms of the user satisfaction, yielding about 29.14% average improvement over the compared rank mechanisms in experiments.
international conference on data engineering | 2005
Zhenglu Yang; Masaru Kitsuregawa
Sequence pattern mining is an important research problem because it is the basis of many other applications. Yet how to efficiently implement the mining is difficult due to the inherent characteristic of the problem - the large size of the data set. In this paper, by combining SPAM, we propose a new algorithm called LAst Position INduction Sequential PAttern Mining (abbreviated as LAPIN-SPAM), which can efficiently get all the frequent sequential patterns from a large database. The main difference between our strategy and the previous works is that when judging whether a sequence is a pattern or not, they use S-Matrix by scanning projected database (PrefixSpan) or count the number by joining (SPADE) or ANDing with the candidate item (SPAM). In contrast, LAPIN-SPAM can easily implement this process based on the following fact - if an item’s last position is smaller than the current prefix position, the item can not appear behind the current prefix in the same customer sequence. LAPIN-SPAM could largely reduce the search space during mining process and is considerable effectiveness in mining sequential pattern. Our experimental results show that LAPIN-SPAM outperforms SPAM up to three times on all kinds of dataset.
Archive | 2013
Guandong Xu; Yu Zong; Zhenglu Yang
Data mining has witnessed substantial advances in recent decades. New research questions and practical challenges have arisen from emerging areas and applications within the various fields closely related to human daily life, e.g. social media and social networking. This book aims to bridge the gap between traditional data mining and the latest advances in newly emerging information services. It explores the extension of well-studied algorithms and approaches into these new research arenas.
database systems for advanced applications | 2007
Zhenglu Yang; Yitong Wang; Masaru Kitsuregawa
Sequential pattern mining is very important because it is the basis of many applications. Although there has been a great deal of effort on sequential pattern mining in recent years, its performance is still far from satisfactory because of two main challenges: large search spaces and the ineffectiveness in handling dense datasets. To offer a solution to the above challenges, we have proposed a series of novel algorithms, called the LAst Position INduction (LAPIN) sequential pattern mining, which is based on the simple idea that the last position of an item, α is the key to judging whether or not a frequent k-length sequential pattern can be extended to be a frequent (k+1)-length pattern by appending the item α to it. LAPIN can largely reduce the search space during the mining process, and is very effective in mining dense datasets. Our performance study demonstrates that LAPIN outperforms PrefixSpan [4] by up to an order of magnitude on long pattern dense datasets.
asia pacific web conference | 2006
Zhenglu Yang; Yitong Wang; Masaru Kitsuregawa
The WWW provides a simple yet effective media for users to search, browse, and retrieve information in the Web. Web log mining is a promising tool to study user behaviors, which could further benefit web-site designers with better organization and services. Although there are many existing systems that can be used to analyze the traversal path of web-site visitors, their performance is still far from satisfactory. In this paper, we propose our effective Web log mining system consists of data preprocessing, sequential pattern mining and visualization. In particular, we propose an efficient sequential mining algorithm (LAPIN_WEB: LAst Position INduction for WEB log), an extension of previous LAPIN algorithm to extract user access patterns from traversal path in Web logs. Our experimental results and performance studies demonstrate that LAPIN_WEB is very efficient and outperforms well-known PrefixSpan by up to an order of magnitude on real Web log datasets. Moreover, we also implement a visualization tool to help interpret mining results as well as predict users’ future requests.
World Wide Web | 2013
Guandong Xu; Zhenglu Yang; Peter Dolog; Yanchun Zhang; Masaru Kitsuregawa
Keyword-based Web search is a widely used approach for locating information on the Web. However, Web users usually suffer from the difficulties of organizing and formulating appropriate input queries due to the lack of sufficient domain knowledge, which greatly affects the search performance. An effective tool to meet the information needs of a search engine user is to suggest Web queries that are topically related to their initial inquiry. Accurately computing query-to-query similarity scores is a key to improve the quality of these suggestions. Because of the short lengths of queries, traditional pseudo-relevance or implicit-relevance based approaches expand the expression of the queries for the similarity computation. They explicitly use a search engine as a complementary source and directly extract additional features (such as terms or URLs) from the top-listed or clicked search results. In this paper, we propose a novel approach by utilizing the hidden topic as an expandable feature. This has two steps. In the offline model-learning step, a hidden topic model is trained, and for each candidate query, its posterior distribution over the hidden topic space is determined to re-express the query instead of the lexical expression. In the online query suggestion step, after inferring the topic distribution for an input query in a similar way, we then calculate the similarity between candidate queries and the input query in terms of their corresponding topic distributions; and produce a suggestion list of candidate queries based on the similarity scores. Our experimental results on two real data sets show that the hidden topic based suggestion is much more efficient than the traditional term or URL based approach, and is effective in finding topically related queries for suggestion.
IEEE Transactions on Knowledge and Data Engineering | 2017
Jun Wang; Jin-Mao Wei; Zhenglu Yang; Shuqin Wang
Feature selection approaches based on mutual information can be roughly categorized into two groups. The first group minimizes the redundancy of features between each other. The second group maximizes the new classification information of features providing for the selected subset. A critical issue is that large new information does not signify little redundancy, and vice versa. Features with large new information but with high redundancy may be selected by the second group, and features with low redundancy but with little relevance with classes may be highly scored by the first group. Existing approaches fail to balance the importance of both terms. As such, a new information term denoted as Independent Classification Information is proposed in this paper. It assembles the newly provided information and the preserved information negatively correlated with the redundant information. Redundancy and new information are properly unified and equally treated in the new term. This strategy helps find the predictive features providing large new information and little redundancy. Moreover, independent classification information is proved as a loose upper bound of the total classification information of feature subset. Its maximization is conducive to achieve a high global discriminative performance. Comprehensive experiments demonstrate the effectiveness of the new approach.
international database engineering and applications symposium | 2006
Zhenglu Yang; Masaru Kitsuregawa; Yitong Wang
Sequential pattern mining is very important because it is the basis of many applications. Yet how to efficiently implement the mining is difficult due to the inherent characteristic of the problem - the large size of the dataset. Although there has been a great deal of effort on sequential pattern mining in recent years, its performance is still far from satisfactory. In this paper, we have proposed a new algorithm called passed item deduced sequential pattern mining (abbreviated as PAID), which can efficiently get all the frequent sequential patterns from a large database. The main difference between our strategy and the existing works is that other algorithms accumulate the candidate support in each iteration from scratch, in contrast, PAID makes good use of the temporary results (support value) of k-length frequent patterns on discovering (k+1)-length patterns, which can reduce the search space greatly in mining sequential patterns. Our experimental results and performance studies show that PAID outperforms the previous works by meaningful margins on large datasets
international joint conference on artificial intelligence | 2011
Zhenglu Yang; Masaru Kitsuregawa
Measuring the semantic meaning between words is an important issue because it is the basis for many applications, such as word sense disambiguation, document summarization, and so forth. Although it has been explored for several decades, most of the studies focus on improving the effectiveness of the problem, i.e., precision and recall. In this paper, we propose to address the efficiency issue, that given a collection of words, how to efficiently discover the top-k most semantic similar words to the query. This issue is very important for real applications yet the existing state-of-the-art strategies cannot satisfy users with reasonable performance. Efficient strategies on searching top-k semantic similar words are proposed. We provide an extensive comparative experimental evaluation demonstrating the advantages of the introduced strategies over the state-of-the-art approaches.
web information systems engineering | 2011
Guandong Xu; Yanhui Gu; Yanchun Zhang; Zhenglu Yang; Masaru Kitsuregawa
Social Annotation Systems have emerged as a popular application with the advance of Web 2.0 technologies. Tags generated by users using arbitrary words to express their own opinions and perceptions on various resources provide a new intermediate dimension between users and resources, which deemed to convey the user preference information. Using clustering for topic extraction and incorporating it with the capture of user preference and resource affiliation is becoming an effective practice in tag-based recommender systems. In this paper, we aim to address these challenges via a topic graph approach. We first propose a Topic Oriented Graph (TOG), which models the user preference and resource affiliation on various topics. Based on the graph, we devise a Topic-Oriented Tag-based Recommendation System (TOAST) by using the preference propagation on the graph. We conduct experiments on two real datasets to demonstrate that our approach outperforms other state-of-the-art algorithms.