Yixing Fan
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yixing Fan.
conference on information and knowledge management | 2016
Jiafeng Guo; Yixing Fan; Qingyao Ai; W. Bruce Croft
In recent years, deep neural networks have led to exciting breakthroughs in speech recognition, computer vision, and natural language processing (NLP) tasks. However, there have been few positive results of deep models on ad-hoc retrieval tasks. This is partially due to the fact that many important characteristics of the ad-hoc retrieval task have not been well addressed in deep models yet. Typically, the ad-hoc retrieval task is formalized as a matching problem between two pieces of text in existing work using deep models, and treated equivalent to many NLP tasks such as paraphrase identification, question answering and automatic conversation. However, we argue that the ad-hoc retrieval task is mainly about relevance matching while most NLP matching tasks concern semantic matching, and there are some fundamental differences between these two matching tasks. Successful relevance matching requires proper handling of the exact matching signals, query term importance, and diverse matching requirements. In this paper, we propose a novel deep relevance matching model (DRMM) for ad-hoc retrieval. Specifically, our model employs a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, we can effectively deal with the three relevance matching factors mentioned above. Experimental results on two representative benchmark collections show that our model can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.
conference on information and knowledge management | 2016
Jiafeng Guo; Yixing Fan; Qingyao Ai; W. Bruce Croft
A common limitation of many information retrieval (IR) models is that relevance scores are solely based on exact (i.e., syntactic) matching of words in queries and documents under the simple Bag-of-Words (BoW) representation. This not only leads to the well-known vocabulary mismatch problem, but also does not allow semantically related words to contribute to the relevance score. Recent advances in word embedding have shown that semantic representations for words can be efficiently learned by distributional models. A natural generalization is then to represent both queries and documents as Bag-of-Word-Embeddings (BoWE), which provides a better foundation for semantic matching than BoW. Based on this representation, we introduce a novel retrieval model by viewing the matching between queries and documents as a non-linear word transportation (NWT) problem. With this formulation, we define the capacity and profit of a transportation model designed for the IR task. We show that this transportation problem can be efficiently solved via pruning and indexing strategies. Experimental results on several representative benchmark datasets show that our model can outperform many state-of-the-art retrieval models as well as recently introduced word embedding-based models. We also conducted extensive experiments to analyze the effect of different settings on our semantic matching model.
conference on information and knowledge management | 2017
Yixing Fan; Jiafeng Guo; Yanyan Lan; Jun Xu; Liang Pang; Xueqi Cheng
When applying learning to rank algorithms to Web search, a large number of features are usually designed to capture the relevance signals. Most of these features are computed based on the extracted textual elements, link analysis, and user logs. However, Web pages are not solely linked texts, but have structured layout organizing a large variety of elements in different styles. Such layout itself can convey useful visual information, indicating the relevance of a Web page. For example, the query-independent layout (i.e., raw page layout) can help identify the page quality, while the query-dependent layout (i.e., page rendered with matched query words) can further tell rich structural information (e.g., size, position and proximity) of the matching signals. However, such visual information of layout has been seldom utilized in Web search in the past. In this work, we propose to learn rich visual features automatically from the layout of Web pages (i.e., Web page snapshots) for relevance ranking. Both query-independent and query-dependent snapshots are considered as the new inputs. We then propose a novel visual perception model inspired by humans visual search behaviors on page viewing to extract the visual features. This model can be learned end-to-end together with traditional human-crafted features. We also show that such visual features can be efficiently acquired in the online setting with an extended inverted indexing scheme. Experiments on benchmark collections demonstrate that learning visual features from Web page snapshots can significantly improve the performance of relevance ranking in ad-hoc Web retrieval tasks.
international acm sigir conference on research and development in information retrieval | 2018
Yixing Fan; Jiafeng Guo; Yanyan Lan; Jun Xu; ChengXiang Zhai; Xueqi Cheng
Assessing relevance between a query and a document is challenging in ad-hoc retrieval due to its diverse patterns, i.e., a document could be relevant to a query as a whole or partially as long as it provides sufficient information for users need. Such diverse relevance patterns require an ideal retrieval model to be able to assess relevance in the right granularity adaptively. Unfortunately, most existing retrieval models compute relevance at a single granularity, either document-wide or passage-level, or use fixed combination strategy, restricting their ability in capturing diverse relevance patterns. In this work, we propose a data-driven method to allow relevance signals at different granularities to compete with each other for final relevance assessment. Specifically, we propose a HIerarchical Neural maTching model (HiNT) which consists of two stacked components, namely local matching layer and global decision layer. The local matching layer focuses on producing a set of local relevance signals by modeling the semantic matching between a query and each passage of a document. The global decision layer accumulates local signals into different granularities and allows them to compete with each other to decide the final relevance score.Experimental results demonstrate that our HiNT model outperforms existing state-of-the-art retrieval models significantly on benchmark ad-hoc retrieval datasets.
conference on information and knowledge management | 2018
Ruqing Zhang; Jiafeng Guo; Yixing Fan; Yanyan Lan; Jun Xu; Huanhuan Cao; Xueqi Cheng
In this paper, we introduce and tackle the Question Headline Generation (QHG) task. The motivation comes from the investigation of a real-world news portal where we find that news articles with question headlines often receive much higher click-through ratio than those with non-question headlines. The QHG task can be viewed as a specific form of the Question Generation (QG) task, with the emphasis on creating a natural question from a given news article by taking the entire article as the answer. A good QHG model thus should be able to generate a question by summarizing the essential topics of an article. Based on this idea, we propose a novel dual-attention sequence-to-sequence model (DASeq2Seq) for the QHG task. Unlike traditional sequence-to-sequence models which only employ the attention mechanism in the decoding phase for better generation, our DASeq2Seq further introduces a self-attention mechanism in the encoding phase to help generate a good summary of the article. We investigate two ways of the self-attention mechanism, namely global self-attention and distributed self-attention. Besides, we employ a vocabulary gate over both generic and question vocabularies to better capture the question patterns. Through the offline experiments, we show that our approach can significantly outperform the state-of-the-art question generation or headline generation models. Furthermore, we also conduct online evaluation to demonstrate the effectiveness of our approach using A/B test.
China Conference on Information Retrieval | 2018
Tonglei Guo; Jiafeng Guo; Yixing Fan; Yanyan Lan; Jun Xu; Xueqi Cheng
The initial retrieval stage of information retrieval aims to generate as many relevant candidate documents as possible in a simple yet efficient way. Traditional term based retrieval methods like BM25 deal with the problem based on Bag-of-Words (BoW) representation, thus they only focus on exact matching (i.e., syntactic) and lack the consideration for semantically related words. That causes the typical vocabulary mismatch problem and the reduction of performance in terms of recall. The advance of distributed representation (i.e., embedding) of words and documents provides an efficient way to measure the semantic relevance between words. Since embedding can alleviate the vocabulary mismatch problem, it is suitable for the initial retrieval task. We conduct several experiments to compare term based models with embedding based models in terms of recall. We compare above two branches of the initial retrieval models on three representative retrieval tasks (Web-QA, Ad-hoc retrieval and CQA respectively). The results show that embedding based method and term based method are complementary for each other and higher recall can be achieved by combining the above two types of models based on scores or ranking position. We find that combination of the two types of the models based on ranking position usually perform better than combination based on score. Furthermore, since queries and documents are in different forms for diverse application scenarios, it can be observed that the relative performance of the two types are almost same but the absolute performance are significant different regarding to distinct scenarios.
China Conference on Information Retrieval | 2017
Yixing Fan; Jiafeng Guo; Yanyan Lan; Jun Xu; Xueqi Cheng
Academic reading plays an important role in researchers’ daily life. To alleviate the burden of seeking relevant literature from rapidly growing academic repository, different kinds of recommender systems have been introduced in recent years. However, most existing work focused on adopting traditional recommendation techniques, like content-based filtering or collaborative filtering, in the literature recommendation scenario. Little work has yet been done on analyzing the academic reading behaviors to understand the reading patterns and information needs of real-world academic users, which would be a foundation for improving existing recommender systems or designing new ones. In this paper, we aim to tackle this problem by carrying out empirical analysis over large scale academic access data, which can be viewed as a proxy of academic reading behaviors. We conduct global, group-based and sequence-based analysis to address the following questions: (1) Are there any regularities in users’ academic reading behaviors? (2) Will users with different levels of activeness exhibit different information needs? (3) How to correlate one’s future demands with his/her historical behaviors? By answering these questions, we not only unveil useful patterns and strategies for literature recommendation, but also identify some challenging problems for future development.
arXiv: Information Retrieval | 2017
Yixing Fan; Liang Pang; Jianpeng Hou; Jiafeng Guo; Yanyan Lan; Xueqi Cheng
meeting of the association for computational linguistics | 2018
Ruqing Zhang; Jiafeng Guo; Yixing Fan; Yanyan Lan; Jun Xu; Xueqi Cheng
arXiv: Information Retrieval | 2018
Yan Xiao; Jiafeng Guo; Yixing Fan; Yanyan Lan; Jun Xu; Xueqi Cheng