Guihong Cao
Université de Montréal
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Guihong Cao.
international acm sigir conference on research and development in information retrieval | 2008
Guihong Cao; Jian-Yun Nie; Jianfeng Gao; Stephen E. Robertson
Pseudo-relevance feedback assumes that most frequent terms in the pseudo-feedback documents are useful for the retrieval. In this study, we re-examine this assumption and show that it does not hold in reality - many expansion terms identified in traditional approaches are indeed unrelated to the query and harmful to the retrieval. We also show that good expansion terms cannot be distinguished from bad ones merely on their distributions in the feedback documents and in the whole collection. We then propose to integrate a term classification process to predict the usefulness of expansion terms. Multiple additional features can be integrated in this process. Our experiments on three TREC collections show that retrieval effectiveness can be much improved when term classification is used. In addition, we also demonstrate that good terms should be identified directly according to their possible impact on the retrieval effectiveness, i.e. using supervised learning, instead of unsupervised learning.
conference on information and knowledge management | 2005
Jing Bai; Dawei Song; Peter D. Bruza; Jian-Yun Nie; Guihong Cao
Language Modeling (LM) has been successfully applied to Information Retrieval (IR). However, most of the existing LM approaches only rely on term occurrences in documents, queries and document collections. In traditional unigram based models, terms (or words) are usually considered to be independent. In some recent studies, dependence models have been proposed to incorporate term relationships into LM, so that links can be created between words in the same sentence, and term relationships (e.g. synonymy) can be used to expand the document model. In this study, we further extend this family of dependence models in the following two ways: (1) Term relationships are used to expand query model instead of document model, so that query expansion process can be naturally implemented; (2) We exploit more sophisticated inferential relationships extracted with Information Flow (IF). Information flow relationships are not simply pairwise term relationships as those used in previous studies, but are between a set of terms and another term. They allow for context-dependent query expansion. Our experiments conducted on TREC collections show that we can obtain large and significant improvements with our approach. This study shows that LM is an appropriate framework to implement effective query expansion.
international acm sigir conference on research and development in information retrieval | 2005
Guihong Cao; Jian-Yun Nie; Jing Bai
In this paper, we propose a novel dependency language modeling approach for information retrieval. The approach extends the existing language modeling approach by relaxing the independence assumption. Our goal is to build a language model in which various word relationships can be integrated. In this work, we integrate two types of relationship extracted from WordNet and co-occurrence relationships respectively. The integrated model has been tested on several TREC collections. The results show that our model achieves substantial and significant improvements with respect to the models without these relationships. These results clearly show the benefit of integrating word relationships into language models for IR.
international acm sigir conference on research and development in information retrieval | 2007
Jing Bai; Jian-Yun Nie; Guihong Cao; Hugues Bouchard
User query is an element that specifies an information need, but it is not the only one. Studies in literature have found many contextual factors that strongly influence the interpretation of a query. Recent studies have tried to consider the users interests by creating a user profile. However, a single profile for a user may not be sufficient for a variety of queries of the user. In this study, we propose to use query-specific contexts instead of user-centric ones, including context around query and context within query. The former specifies the environment of a query such as the domain of interest, while the latter refers to context words within the query, which is particularly useful for the selection of relevant term relations. In this paper, both types of context are integrated in an IR model based on language modeling. Our experiments on several TREC collections show that each of the context factors brings significant improvements in retrieval effectiveness.
language and technology conference | 2006
Chin-Yew Lin; Guihong Cao; Jianfeng Gao; Jian-Yun Nie
Until recently there are no common, convenient, and repeatable evaluation methods that could be easily applied to support fast turn-around development of automatic text summarization systems. In this paper, we introduce an information-theoretic approach to automatic evaluation of summaries based on the Jensen-Shannon divergence of distributions between an automatic summary and a set of reference summaries. Several variants of the approach are also considered and compared. The results indicate that JS divergence-based evaluation method achieves comparable performance with the common automatic evaluation method ROUGE in single documents summarization task; while achieves better performance than ROUGE in multiple document summarization task.
conference on information and knowledge management | 2007
Guihong Cao; Jianfeng Gao; Jian-Yun Nie; Jing Bai
Dictionary-based approaches to query translation have been widely used in Cross-Language Information Retrieval (CLIR) experiments. However, translation has been not only limited by the coverage of the dictionary, but also affected by translation ambiguities. In this paper we propose a novel method of query translation that combines other types of term relation to complement the dictionary-based translation. This allows extending the literal query translation to related words, which produce a beneficial effect of query expansion in CLIR. In this paper, we model query translation by Markov Chains (MC), where query translation is viewed as a process of expanding query terms to their semantically similar terms in a different language. In MC, terms and their relationships are modeled as a directed graph, and query translation is performed as a random walk in the graph, which propagates probabilities to related terms. This framework allows us to incorporating different types of term relation, either between two languages or within the source or target languages. In addition, the iterative training process of MC allows us to attribute higher probabilities to the target terms more related to the original query, thus offers a solution to the translation ambiguity problem. We evaluated our method on three CLIR benchmark collections, and obtained significant improvements over traditional dictionary-based approaches.
ACM Transactions on Asian Language Information Processing | 2006
Jian-Yun Nie; Guihong Cao; Jing Bai
Language modeling (LM) has been widely used in IR in recent years. An important operation in LM is smoothing of the document language model. However, the current smoothing techniques merely redistribute a portion of term probability according to their frequency of occurrences only in the whole document collection. No relationships between terms are considered and no inference is involved. In this article, we propose several inferential language models capable of inference using term relationships. The inference operation is carried out through a semantic smoothing either on the document model or query model, resulting in document or query expansion. The proposed models implement some of the logical inference capabilities proposed in the previous studies on logical models, but with necessary simplifications in order to make them tractable. They are a good compromise between inference power and efficiency. The models have been tested on several TREC collections, both in English and Chinese. It is shown that the integration of term relationships into the language modeling framework can consistently improve the retrieval effectiveness compared with the traditional language models. This study shows that language modeling is a suitable framework to implement basic inference operations in IR effectively.
meeting of the association for computational linguistics | 2002
Jianfeng Gao; Joshua T. Goodman; Guihong Cao; Hang Li
The n-gram model is a stochastic model, which predicts the next word (predicted word) given the previous words (conditional words) in a word sequence. The cluster n-gram model is a variant of the n-gram model in which similar words are classified in the same cluster. It has been demonstrated that using different clusters for predicted and conditional words leads to cluster models that are superior to classical cluster models which use the same clusters for both words. This is the basis of the asymmetric cluster model (ACM) discussed in our study. In this paper, we first present a formal definition of the ACM. We then describe in detail the methodology of constructing the ACM. The effectiveness of the ACM is evaluated on a realistic application, namely Japanese Kana-Kanji conversion. Experimental results show substantial improvements of the ACM in comparison with classical cluster models and word n-gram models at the same model size. Our analysis shows that the high-performance of the ACM lies in the asymmetry of the model.
web intelligence | 2005
Jing Bai; Jian-Yun Nie; Guihong Cao
Text classification usually assumed a word-based document representation. In this paper, we propose a new approach to integrate compound terms in Bayesian text classification. Compound terms are used as complementary features to single words. An acute problem is to consider their dependence with the component words. In this paper, we propose to use smoothing techniques to combine both compound term and word representations. Experiments have been conducted on two corpora. Our results show that this approach can slightly but steadily improve the classification performance on both test corpora.
asia-pacific web conference | 2004
Guihong Cao; Dawei Song; Peter D. Bruza
One way of representing semantics is via a high dimensional conceptual space constructed from lexical co-occurrence. Concepts (words) are represented as a vector whereby the dimensions are other words. As the words are represented as dimensional objects, clustering techniques can be applied to compute word clusters. Conventional clustering algorithms, e.g., the K-means method, however, normally produce crisp clusters, i.e., an object is assigned to only one cluster. This is sometimes not desirable. Therefore, a fuzzy membership function can be applied to the K-Means clustering, which models the degree of an object belonging to certain cluster. This paper introduces a fuzzy k-means clustering algorithm and how it is used to word clustering on the high dimensional semantic space constructed by a cognitively motivated semantic space model, namely Hyperspace Analogue to Language. A case study demonstrates the method is promising.