Yanyan Lan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanyan Lan is active.

Explore More

Publication

Featured researches published by Yanyan Lan.

international world wide web conferences | 2013

A biterm topic model for short texts

Xiaohui Yan; Jiafeng Guo; Yanyan Lan; Xueqi Cheng

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.

IEEE Transactions on Knowledge and Data Engineering | 2014

BTM: Topic Modeling over Short Texts

Xueqi Cheng; Xiaohui Yan; Yanyan Lan; Jiafeng Guo

Short texts are popular on todays web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM). BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.

international conference on machine learning | 2009

Generalization analysis of listwise learning-to-rank algorithms

Yanyan Lan; Tie-Yan Liu; Zhi-Ming Ma; Hang Li

This paper presents a theoretical framework for ranking, and demonstrates how to perform generalization analysis of listwise ranking algorithms using the framework. Many learning-to-rank algorithms have been proposed in recent years. Among them, the listwise approach has shown higher empirical ranking performance when compared to the other approaches. However, there is no theoretical study on the listwise approach as far as we know. In this paper, we propose a theoretical framework for ranking, which can naturally describe various listwise learning-to-rank algorithms. With this framework, we prove a theorem which gives a generalization bound of a listwise ranking algorithm, on the basis of Rademacher Average of the class of compound functions. The compound functions take listwise loss functions as outer functions and ranking models as inner functions. We then compute the Rademacher Averages for existing listwise algorithms of ListMLE, ListNet, and RankCosine. We also discuss the tightness of the bounds in different situations with regard to the list length and transformation function.

international acm sigir conference on research and development in information retrieval | 2014

Learning for search result diversification

Yadong Zhu; Yanyan Lan; Jiafeng Guo; Xueqi Cheng; Shuzi Niu

Search result diversification has gained attention as a way to tackle the ambiguous or multi-faceted information needs of users. Most existing methods on this problem utilize a heuristic predefined ranking function, where limited features can be incorporated and extensive tuning is required for different settings. In this paper, we address search result diversification as a learning problem, and introduce a novel relational learning-to-rank approach to formulate the task. However, the definitions of ranking function and loss function for the diversification problem are challenging. In our work, we firstly show that diverse ranking is in general a sequential selection process from both empirical and theoretical aspects. On this basis, we define ranking function as the combination of relevance score and diversity score between the current document and those previously selected, and loss function as the likelihood loss of ground truth based on Plackett-Luce model, which can naturally model the sequential generation of a diverse ranking list. Stochastic gradient descent is then employed to conduct the unconstrained optimization, and the prediction of a diverse ranking list is provided by a sequential selection process based on the learned ranking function. The experimental results on the public TREC datasets demonstrate the effectiveness and robustness of our approach.

international acm sigir conference on research and development in information retrieval | 2012

Top-k learning to rank: labeling, ranking and evaluation

Shuzi Niu; Jiafeng Guo; Yanyan Lan; Xueqi Cheng

In this paper, we propose a novel top-k learning to rank framework, which involves labeling strategy, ranking model and evaluation measure. The motivation comes from the difficulty in obtaining reliable relevance judgments from human assessors when applying learning to rank in real search systems. The traditional absolute relevance judgment method is difficult in both gradation specification and human assessing, resulting in high level of disagreement on judgments. While the pairwise preference judgment, as a good alternative, is often criticized for increasing the complexity of judgment from O(n) to (n log n). Considering the fact that users mainly care about top ranked search results, we propose a novel top-k labeling strategy which adopts the pairwise preference judgment to generate the top k ordering items from n documents (i.e. top-k ground-truth) in a manner similar to that of HeapSort. As a result, the complexity of judgment is reduced to O(n log k). With the top-k ground-truth, traditional ranking models (e.g. pairwise or listwise models) and evaluation measures (e.g. NDCG) no longer fit the data set. Therefore, we introduce a new ranking model, namely FocusedRank, which fully captures the characteristics of the top-k ground-truth. We also extend the widely used evaluation measures NDCG and ERR to be applicable to the top-k ground-truth, referred as κ-NDCG and κ-ERR, respectively. Finally, we conduct extensive experiments on benchmark data collections to demonstrate the efficiency and effectiveness of our top-k labeling strategy and ranking models.

international conference on machine learning | 2008

Query-level stability and generalization in learning to rank

Yanyan Lan; Tie-Yan Liu; Tao Qin; Zhi-Ming Ma; Hang Li

This paper is concerned with the generalization ability of learning to rank algorithms for information retrieval (IR). We point out that the key for addressing the learning problem is to look at it from the viewpoint of query. We define a number of new concepts, including query-level loss, query-level risk, and query-level stability. We then analyze the generalization ability of learning to rank algorithms by giving query-level generalization bounds to them using query-level stability as a tool. Such an analysis is very helpful for us to derive more advanced algorithms for IR. We apply the proposed theory to the existing algorithms of Ranking SVM and IRSVM. Experimental results on the two algorithms verify the correctness of the theoretical analysis.

international acm sigir conference on research and development in information retrieval | 2015

Learning Maximal Marginal Relevance Model via Directly Optimizing Diversity Evaluation Measures

Long Xia; Jun Xu; Yanyan Lan; Jiafeng Guo; Xueqi Cheng

In this paper we address the issue of learning a ranking model for search result diversification. In the task, a model concerns with both query-document relevance and document diversity is automatically created with training data. Ideally a diverse ranking model would be designed to meet the criterion of maximal marginal relevance, for selecting documents that have the least similarity to previously selected documents. Also, an ideal learning algorithm for diverse ranking would train a ranking model that could directly optimize the diversity evaluation measures with respect to the training data. Existing methods, however, either fail to model the marginal relevance, or train ranking models by minimizing loss functions that loosely related to the evaluation measures. To deal with the problem, we propose a novel learning algorithm under the framework of Perceptron, which adopts the ranking model that \emph{maximizes marginal relevance at ranking and can optimize any diversity evaluation measure in training}. The algorithm, referred to as PAMM (Perceptron Algorithm using Measures as Margins), first constructs positive and negative diverse rankings for each training query, and then repeatedly adjusts the model parameters so that the margins between the positive and negative rankings are maximized. Experimental results on three benchmark datasets show that PAMM significantly outperforms the state-of-the-art baseline methods.

international acm sigir conference on research and development in information retrieval | 2016

Modeling Document Novelty with Neural Tensor Network for Search Result Diversification

Long Xia; Jun Xu; Yanyan Lan; Jiafeng Guo; Xueqi Cheng

Search result diversification has attracted considerable attention as a means to tackle the ambiguous or multi-faceted information needs of users. One of the key problems in search result diversification is novelty, that is, how to measure the novelty of a candidate document with respect to other documents. In the heuristic approaches, the predefined document similarity functions are directly utilized for defining the novelty. In the learning approaches, the novelty is characterized based on a set of handcrafted features. Both the similarity functions and the features are difficult to manually design in real world due to the complexity of modeling the document novelty. In this paper, we propose to model the novelty of a document with a neural tensor network. Instead of manually defining the similarity functions or features, the new method automatically learns a nonlinear novelty function based on the preliminary representation of the candidate document and other documents. New diverse learning to rank models can be derived under the relational learning to rank framework. To determine the model parameters, loss functions are constructed and optimized with stochastic gradient descent. Extensive experiments on three public TREC datasets show that the new derived algorithms can significantly outperform the baselines, including the state-of-the-art relational learning to rank models.

Neurocomputing | 2015

Recommending high-utility search engine queries via a query-recommending model

JianGuo Wang; Joshua Zhexue Huang; Jiafeng Guo; Yanyan Lan

Query recommendation technology is of great importance for search engines, because it can assist users to find the information they require. Many query recommendation algorithms have been proposed, but they all aim to recommend similar queries and cannot guarantee the usefulness of the recommended queries. In this paper, we argue that it is more important to recommend high-utility queries, i.e., queries that would induce users to search for more useful information. For this purpose, we propose a query-recommending model to rank candidate queries according to their utilities and to recommend those that are useful to users. The query-recommending model ranks a candidate query by assessing the joint probability that the query is selected by the user, that the obtained search results are subsequently clicked by the user, and that the clicked search results ultimately satisfy the user?s information need. Three utilities were defined to solve the model: query-level utility, representing the attractiveness of a query to the user; perceived utility, measuring the user?s probability of clicking on the search results; and posterior utility, measuring the useful information obtained by the user from the clicked search results. The methods that were used to compute these three utilities from the query log data are presented. The experimental results that were obtained by using real query log data demonstrated that the proposed query-recommending model outperformed six other baseline methods in generating more useful recommendations.

international joint conference on natural language processing | 2015

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Fei Sun; Jiafeng Guo; Yanyan Lan; Jun Xu; Xueqi Cheng

Vector space representation of words has been widely used to capture fine-grained linguistic regularities, and proven to be successful in various natural language processing tasks in recent years. However, existing models for learning word representations focus on either syntagmatic or paradigmatic relations alone. In this paper, we argue that it is beneficial to jointly modeling both relations so that we can not only encode different types of linguistic properties in a unified way, but also boost the representation learning due to the mutual enhancement between these two types of relations. We propose two novel distributional models for word representation using both syntagmatic and paradigmatic relations via a joint training objective. The proposed models are trained on a public Wikipedia corpus, and the learned representations are evaluated on word analogy and word similarity tasks. The results demonstrate that the proposed models can perform significantly better than all the state-of-the-art baseline methods on both tasks.

Explore More