Jianwu Yang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jianwu Yang is active.

Explore More

Publication

Featured researches published by Jianwu Yang.

international acm sigir conference on research and development in information retrieval | 2008

Multi-document summarization using cluster-based link analysis

Xiaojun Wan; Jianwu Yang

The Markov Random Walk model has been recently exploited for multi-document summarization by making use of the link relationships between sentences in the document set, under the assumption that all the sentences are indistinguishable from each other. However, a given document set usually covers a few topic themes with each theme represented by a cluster of sentences. The topic themes are usually not equally important and the sentences in an important theme cluster are deemed more salient than the sentences in a trivial theme cluster. This paper proposes the Cluster-based Conditional Markov Random Walk Model (ClusterCMRW) and the Cluster-based HITS Model (ClusterHITS) to fully leverage the cluster-level information. Experimental results on the DUC2001 and DUC2002 datasets demonstrate the good effectiveness of our proposed summarization models. The results also demonstrate that the ClusterCMRW model is more robust than the ClusterHITS model, with respect to different cluster numbers.

north american chapter of the association for computational linguistics | 2006

Improved Affinity Graph Based Multi-Document Summarization

Xiaojun Wan; Jianwu Yang

This paper describes an affinity graph based approach to multi-document summarization. We incorporate a diffusion process to acquire semantic relationships between sentences, and then compute information richness of sentences by a graph rank algorithm on differentiated intra-document links and inter-document links between sentences. A greedy algorithm is employed to impose diversity penalty on sentences and the sentences with both high information richness and high information novelty are chosen into the summary. Experimental results on task 2 of DUC 2002 and task 2 of DUC 2004 demonstrate that the proposed approach outperforms existing state-of-the-art systems.

international acm sigir conference on research and development in information retrieval | 2007

CollabSum: exploiting multiple document clustering for collaborative single document summarizations

Xiaojun Wan; Jianwu Yang

Almost all existing methods conduct the summarization tasks for single documents separately without interactions for each document under the assumption that the documents are considered independent of each other. This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context. In this study, CollabSum is implemented by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarizations within each cluster. Both the with-document and cross-document relationships between sentences are incorporated in the algorithm. Experiments on the DUC2001 and DUC2002 datasets demonstrate the encouraging performance of the proposed approach. Different clustering algorithms have been investigated and we find that the summarization performance relies positively on the quality of document cluster.

Information Processing and Management | 2008

Towards a unified approach to document similarity search using manifold-ranking of blocks

Xiaojun Wan; Jianwu Yang; Jianguo Xiao

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.

web intelligence | 2006

Using Cross-Document Random Walks for Topic-Focused Multi-Document

Xiaojun Wan; Jianwu Yang; Jianguo Xiao

Graph-ranking based methods have been developed for generic multi-document summarization in recent years and they make uniform use of the relationships between sentences to extract salient sentences. This paper proposes to integrate the relevance of the sentences to the specified topic into the graph-ranking based method for topic-focused multi-document summarization. The cross-document relationships and the within-document relationships between sentences are differentiated and we apply the graph-ranking based method using each individual kind of sentence relationships and explore their relative importance for topic-focused multi-document summarization. Experimental results on DUC2003 and DUC2005 demonstrate the great importance of the cross-document relationships between sentences for topic-focused multi-document summarization. Even the approach based only on the cross-document sentence relationships can perform better than or at least as well as the approaches based on both kinds of sentence relationships

international world wide web conferences | 2007

Learning information diffusion process on the web

Xiaojun Wan; Jianwu Yang

Many text documents on the Web are not originally created but forwarded or copied from other source documents. The phenomenon of document forwarding or transmission between various web sites is denoted as Web information diffusion. This paper focuses on mining information diffusion processes for specific topics on the Web. A novel system called LIDPW is proposed to address this problem using matching learning techniques. The source site and source document of each document are identified and the diffusion process composed of a sequence of diffusion relationships is visually presented to users. The effectiveness of LIDPW is validated on a real data set. A preliminary user study is performed and the results show that LIDPW does benefit users to monitor the information diffusion process of a specific topic, and aid them to discover the diffusion start and diffusion center of the topic.

asia information retrieval symposium | 2015

Knowledge-Based Query Expansion in Real-Time Microblog Search

Chao Lv; Runwei Qiang; Feifan Fan; Jianwu Yang

Since the length of microblog texts, such as tweets, is strictly limited to 140 characters, traditional Information Retrieval techniques usually suffer severely from the vocabulary mismatch problem such that they cannot yield good performance in the context of microblogosphere. To address this critical challenge, in this paper, we propose a new language modeling approach for microblog retrieval by inferring various types of context information. In particular, we expand the query using knowledge terms derived from Freebase so that the expanded one can better reflect the information need. Besides, in order to further answer users’ real-time information need, we incorporate temporal evidences into the expansion methods so that the proposed approach can boost recent tweets in the retrieval results with respect to a given topic. Experimental results on two official TREC Twitter corpora demonstrate the significant superiority of our approach over baseline methods.

asia pacific web conference | 2006

WordRank-Based lexical signatures for finding lost or related web pages

Xiaojun Wan; Jianwu Yang

A lexical signature of a web page consists of several key words carefully chosen from the web page and is used to generate robust hyperlink to find the web page when its URL fails. In this paper, we propose a novel method based on WordRank to compute lexical signatures, which can take into account the semantic relatedness between words and choose the most representative and salient words as lexical signature. Experiments show that the DF-based lexical signatures are best at uniquely identifying web pages, and hybrid lexical signatures are good candidates for retrieving the desired web pages, while WordRank-based lexical signatures are best for retrieving highly relevant web pages when the desired web page cannot be extracted.

conference on information and knowledge management | 2015

Improving Microblog Retrieval with Feedback Entity Model

Feifan Fan; Runwei Qiang; Chao Lv; Jianwu Yang

When searching over the microblogging, users prefer using queries including terms that represent some specific entities. Meanwhile, tweets, though limited within 140 characters, are often generated with one or more entities. Entities, as an important part of tweets, usually convey rich information for modeling relevance from new perspectives. In this paper, we propose a feedback entity model and integrate it into an adaptive language modeling framework in order to improve the retrieval performance. The feedback entity model is estimated with the latest entity-associated tweets based upon a regularized maximum likelihood criterion. More specifically, we assume that the entity-associated tweets are generated by a mixture model, which consists of the entity model, the domain-specific language model and the collection language model. Experimental results on two public Text Retrieval Conference (TREC) Twitter corpora demonstrate the significant superiority of our approach over the state-of-the-art baselines.

asia-pacific web conference | 2010

Named Entity Resolution in Chinese News Comments on the Web

Liang Zong; Xiaojun Wan; Lihong Zhao; Jianwu Yang; Yuqian Wu

News comment is a new text genre which people use to express their opinions on recent news events. Different from normal text corpus, news comments have some particular properties. The named entities in the news comments usually use some wrongly written words, informal abbreviations or aliases, which bring great difficulties for machine detection and understanding. This paper addresses the issue of named entity resolution in Chinese news comments on the web, which is a special case of coreference resolution. Traditional resolution algorithms have some limitations for this special task. In this paper, we first define the special task, and then propose a novel resolution algorithm with new features to improve the resolution performance. We manually labeled a benchmark dataset with 60 pieces of news and their corresponding comments downloaded from a popular Chinese news portal and the experimental results on the dataset show that our algorithm is effective for this special task.

Explore More