Yuanhua Lv | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuanhua Lv is active.

Explore More

Publication

Featured researches published by Yuanhua Lv.

conference on information and knowledge management | 2009

A comparative study of methods for estimating query language models with pseudo feedback

Yuanhua Lv; ChengXiang Zhai

We systematically compare five representative state-of-the-art methods for estimating query language models with pseudo feedback in ad hoc information retrieval, including two variants of the relevance language model, two variants of the mixture feedback model, and the divergence minimization estimation method. Our experiment results show that a variant of relevance model and a variant of the mixture model tend to outperform other methods. We further propose several heuristics that are intuitively related to the good retrieval performance of an estimation method, and show that the variations in how these heuristics are implemented in different methods provide a good explanation of many empirical observations.

international acm sigir conference on research and development in information retrieval | 2010

Positional relevance model for pseudo-relevance feedback

Yuanhua Lv; ChengXiang Zhai

Pseudo-relevance feedback is an effective technique for improving retrieval results. Traditional feedback algorithms use a whole feedback document as a unit to extract words for query expansion, which is not optimal as a document may cover several different topics and thus contain much irrelevant information. In this paper, we study how to effectively select from feedback documents those words that are focused on the query topic based on positions of terms in feedback documents. We propose a positional relevance model (PRM) to address this problem in a unified probabilistic way. The proposed PRM is an extension of the relevance model to exploit term positions and proximity so as to assign more weights to words closer to query words based on the intuition that words closer to query words are more likely to be related to the query topic. We develop two methods to estimate PRM based on different sampling processes. Experiment results on two large retrieval datasets show that the proposed PRM is effective and robust for pseudo-relevance feedback, significantly outperforming the relevance model in both document-based feedback and passage-based feedback.

conference on information and knowledge management | 2009

Adaptive relevance feedback in information retrieval

Yuanhua Lv; ChengXiang Zhai

Relevance Feedback has proven very effective for improving retrieval accuracy. A difficult yet important problem in all relevance feedback methods is how to optimally balance the original query and feedback information. In the current feedback methods, the balance parameter is usually set to a fixed value across all the queries and collections. However, due to the difference in queries and feedback documents, this balance parameter should be optimized for each query and each set of feedback documents. In this paper, we present a learning approach to adaptively predict the optimal balance coefficient for each query and each collection. We propose three heuristics to characterize the balance between query and feedback information. Taking these three heuristics as a road map, we explore a number of features and combine them using a regression approach to predict the balance coefficient. Our experiments show that the proposed adaptive relevance feedback is more robust and effective than the regular fixed-coefficient feedback.

conference on information and knowledge management | 2011

Lower-bounding term frequency normalization

Yuanhua Lv; ChengXiang Zhai

In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.

international acm sigir conference on research and development in information retrieval | 2011

A boosting approach to improving pseudo-relevance feedback

Yuanhua Lv; ChengXiang Zhai; Wan Chen

Pseudo-relevance feedback has proven effective for improving the average retrieval performance. Unfortunately, many experiments have shown that although pseudo-relevance feedback helps many queries, it also often hurts many other queries, limiting its usefulness in real retrieval applications. Thus an important, yet difficult challenge is to improve the overall effectiveness of pseudo-relevance feedback without sacrificing the performance of individual queries too much. In this paper, we propose a novel learning algorithm, FeedbackBoost, based on the boosting framework to improve pseudo-relevance feedback through optimizing the combination of a set of basis feedback algorithms using a loss function defined to directly measure both robustness and effectiveness. FeedbackBoost can potentially accommodate many basis feedback methods as features in the model, making the proposed method a general optimization framework for pseudo-relevance feedback. As an application, we apply FeedbackBoost to improve pseudo feedback based on language models through combining different document weighting strategies. The experiment results demonstrate that FeedbackBoost can achieve better average precision and meanwhile dramatically reduce the number and magnitude of feedback failures as compared to three representative pseudo feedback methods and a standard learning to rank approach for pseudo feedback.

conference on information and knowledge management | 2011

Adaptive term frequency normalization for BM25

Yuanhua Lv; ChengXiang Zhai

A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.

conference on information and knowledge management | 2014

Revisiting the Divergence Minimization Feedback Model

Yuanhua Lv; ChengXiang Zhai

Pseudo-relevance feedback (PRF) has proven to be an effective strategy for improving retrieval accuracy. In this paper, we revisit a PRF method based on statistical language models, namely the divergence minimization model (DMM). DMM not only has apparently sound theoretical foundation, but also has been shown to satisfy most of the retrieval constraints. However, it turns out to perform surprisingly poorly in many previous experiments. We investigate the cause, and reveal that DMM inappropriately tackles the entropy of the feedback model, which generates highly skewed feedback model. To address this problem, we propose a maximum-entropy divergence minimization model (MEDMM) by introducing an entropy term to regularize DMM. Our experiments on various TREC collections demonstrate that MEDMM not only works much better than DMM, but also outperforms several other state of the art PRF methods, especially on web collections. Moreover, unlike existing PRF models that have to be combined with the original query to perform well, MEDMM can work effectively even without being combined with the original query.

ACM Transactions on Information Systems | 2015

A Pólya Urn Document Language Model for Improved Information Retrieval

Ronan Cummins; Jiaul H. Paik; Yuanhua Lv

The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency—that is, the tendency of a term to repeat itself within a document (i.e., word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Pólya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly. We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than that in the multinomial language model. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new language model essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.

european conference on information retrieval | 2012

A log-logistic model-based interpretation of TF normalization of BM25

Yuanhua Lv; ChengXiang Zhai

The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k1 is optimized based on training data.

conference on information and knowledge management | 2012

Mining long-lasting exploratory user interests from search history

Bin Tan; Yuanhua Lv; ChengXiang Zhai

A users web search history contains many valuable search patterns. In this paper, we study search patterns that represent a users long-lasting and exploratory search interests. By focusing on long-lastingness and exploratoriness, we are able to discover search patterns that are most useful for recommending new and relevant information to the user. Our approach is based on language modeling and clustering, and specifically designed to handle web search logs. We run our algorithm on a real web search log collection, and evaluate its performance using a novel simulated study on the same search log dataset. Experiment results support our hypothesis that long-lastingness and exploratoriness are necessary for generating successful recommendation. Our algorithm is shown to effectively discover such search interest patterns, and thus directly useful for making recommendation based on personal search history.

Explore More