Donghong Ji | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Donghong Ji is active.

Explore More

Publication

Featured researches published by Donghong Ji.

international conference on machine learning and cybernetics | 2006

MSBGA: A Multi-Document Summarization System Based on Genetic Algorithm

Yanxiang He; Dexi Liu; Donghong Ji; Hua Yang; Chong Teng

The multi-document summarizer using genetic algorithm-based sentence extraction (MSBGA) regards summarization process as an optimization problem where the optimal summary is chosen among a set of summaries formed by the conjunction of the original articles sentences. To solve the NP hard optimization problem, MSBGA adopts genetic algorithm, which can choose the optimal summary on global aspect. The evaluation function employs four features according to the criteria of a good summary: satisfied length, high coverage, high informativeness and low redundancy. To improve the accuracy of term frequency, MSBGA employs a novel method TFS, which takes word sense into account while calculating term frequency. The experiments on DUC04 data show that our strategy is effective and the ROUGE-1 score is only 0.55% lower than the best participant in DUC04

international conference on computational linguistics | 2006

Multi-document summarization based on BE-Vector clustering

Dexi Liu; Yanxiang He; Donghong Ji; Hua Yang

In this paper, we propose a novel multi-document summarization strategy based on Basic Element (BE) vector clustering. In this strategy, sentences are represented by BE vectors instead of word or term vectors before clustering. BE is a head-modifier-relation triple representation of sentence content, and it is more precise to use BE as semantic unit than to use word. The BE-vector clustering is realized by adopting the k-means clustering method, and a novel clustering analysis method is employed to automatically detect the number of clusters, K. The experimental results indicate a superiority of the proposed strategy over the traditional summarization strategy based on word vector clustering. The summaries generated by the proposed strategy achieve a ROUGE-1 score of 0.37291 that is better than those generated by traditional strategy (at 0.36936) on DUC04 task-2.

international conference on machine learning and cybernetics | 2006

A Novel Chinese Multi-Document Summarization Using Clustering Based Sentence Extraction

Dexi Liu; Yan-xiang He; Donghong Ji; Hua Yang

This paper proposes a strategy for Chinese multi-document summarization based on clustering and sentence extraction. It adopts the term vector to represent the linguistic unit in Chinese document, which obtains higher representation quality than traditional word-based vector space model in a certain extent. As for clustering, we propose two heuristics to automatically detect the proper number of clusters: the first one makes full use of the summary length fixed by the user; the second is a stability method, which has been applied to other unsupervised learning problems. We also discuss a global searching method for sentence selection from the clusters. To evaluate our summarization strategy, an extrinsic evaluation method based on classification task is adopted. Experimental results on news document set show that the new strategy can significantly enhance the performance of Chinese multi-document summarization

artificial intelligence and computational intelligence | 2009

A Study on Pseudo Labeled Document Constructed for Document Re-ranking

Chong Teng; Yanxiang He; Donghong Ji; Guimin Lin; Zhewei Mai

Document re-ranking is a middle module in information retrieval system. It’s expected that more relevant documents with query appear in higher rankings, from which automatic query expansion can benefit, and it aims at improving the performance of the entire information retrieval. In this paper, we construct a pseudo labeled document based on pseudo-relevance feedback principle, and discuss about the relationship between performance of document re-ranking and the number of top documents in initial retrieval, the number of key terms from the top documents when constructing a pseudo labeled document. Experiment shows our approach of a pseudo labeled document constructed is greatly helpful to document re-ranking. It is the main contribution in the paper. Moreover, experiment shows the performance of document re-ranking is decreasing as the number of top documents increases; and increasing as the number of key terms from these documents increases.

pacific rim international conference on artificial intelligence | 2006

Chinese multi-document summarization using adaptive clustering and global search strategy

Dexi Liu; Yanxiang He; Donghong Ji; Hua Yang; Zhao Wu

Multi-document summarization has become a key technology in natural language processing. This paper proposes a strategy for Chinese multidocument summarization based on clustering and sentence extraction. As for clustering, we propose two heuristics to automatically detect the proper number of clusters: the first one makes full use of the summary length fixed by the user; the second is a stability method, which has been applied to other unsupervised learning problems. We also discuss a global searching method for sentence selection from the clusters. To evaluate our summarization strategy, an extrinsic evaluation method based on classification task is adopted. Experimental results on news document set show that the new strategy can significantly enhance the performance of Chinese multi-document summarization.

web information systems engineering | 2006

A hybrid sentence ordering strategy in multi-document summarization

Yanxiang He; Dexi Liu; Hua Yang; Donghong Ji; Chong Teng; Wenqing Qi

In extractive summarization, a proper arrangement of extracted sentences must be found if we want to generate a logical, coherent and readable summary. This issue is special in multi-document summarization. In this paper, several existing methods each of which generate a reference relation are combined through linear combination of the resulting relations. We use 4 types of relationships between sentences (chronological relation, positional relation, topical relation and dependent relation) to build a graph model where the vertices are sentences and edges are weighed relationships of the 4 types. And then apply a variation of page rank to get the ordering of sentences for multi-document summaries. We tested our hybrid model with two automatic methods: distance to manual ordering and ROUGE score. Evaluation results show a significant improvement of the ordering over strategies losing some relations. The results also indicate that this hybrid model is robust for articles with different genre which were used on DUC2004 and DUC2005.

web information systems engineering | 2016

Twitter Normalization via 1-to-N Recovering

Yafeng Ren; Jiayuan Deng; Donghong Ji

Twitter messages are written in an informal style, which hinders many information retrieval and natural language processing applications. Existing normalization systems have two major drawbacks. The first is that these methods largely require large-scale annotated training data. The second is that these systems assume that a nonstandard token is recovered to one standard word. However, there are many nonstandard tokens that should be recovered to two or more standard words, so the problem remains to be highly challenging. To address the above issues, we propose an unsupervised normalization system based on the context similarity. The proposed system does not require any annotated data. Meanwhile, a nonstandard token will be recovered to one or more standard words. Results show that the proposed approach achieves state-of-the-art performance.

international conference on intelligent computing | 2015

Semantic Role Labeling for Biomedical Corpus Using Maximum Entropy Classifier

Lei Han; Donghong Ji; Han Ren

Semantic role labeling (SRL) is a natural language processing (NLP) task that finds shallow semantic representations from sentences. In this paper, we construct a biomedical proposition bank and train a biomedical semantic role labeling system that can be used to facilitate relation extraction and information retrieval in biomedical domain. Firstly, we construct a proposition bank on the basis of the GENIA TreeBank following the Penn PropBank annotation. Secondly, we use GenPropBank to train a biomedical SRL system, which uses maximum entropy as a classifier. Our experimental results show that a newswire SRL system that achieves an F1 of 85.56 % in the newswire domain can only maintain an F1 of 65.43 % when ported to the biomedical domain. By using our annotated biomedical corpus, we can increase that F1 by 19.2 %.

international conference on intelligent computing | 2009

Entropy-based clustering for improving document re-ranking

Chong Teng; Yanxiang He; Donghong Ji; Cheng zhou; Yixuan Geng; Shu Chen

Document re-ranking locates between initial retrieval and query expansion in information retrieval system. In this paper, we propose entropy-based clustering approach for document re-ranking. The value of within-cluster entropy determines whether two classes should be merged, and the value of between-cluster entropy determines how many clusters are reasonable. What to do next is finding a suitable cluster from clustering result to construct pseudo labeled document, and conduct document re-ranking as our previous method. We focus clustering strategy for documents after initial retrieval. Experiment with NTCIR-5 data show that the approach can improve the performance of initial retrieval, and it is helpful for improving the quality of document re-ranking.

IEEE International Workshop on Semantic Computing and Systems | 2008

A Research on Connectivity of Undirected Basic Element Complex Network of Topic-Related Documents

Hua Yang; Yanxiang He; Donghong Ji; Dexi Liu

A research on the connectivity of undirected version Basic Element complex network generated from topic-related document set is carried out. Key properties of the network are obtained. The networks are mainly connected except some components involving very small ratio of nodes of the whole network. And the key semantic units do not exist in these small components. The properties will simplify the work if network methods are taken to address some nature language processing tasks.

Explore More