Hongsong Li | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongsong Li is active.

Explore More

Publication

Featured researches published by Hongsong Li.

international conference on management of data | 2012

Probase: a probabilistic taxonomy for text understanding

Wentao Wu; Hongsong Li; Haixun Wang; Kenny Q. Zhu

Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing ontologies has the needed depth and breadth for universal understanding. In this paper, we present a universal, probabilistic taxonomy that is more comprehensive than any existing ones. It contains 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages. Unlike traditional taxonomies that treat knowledge as black and white, it uses probabilities to model inconsistent, ambiguous and uncertain information it contains. We present details of how the taxonomy is constructed, its probabilistic modeling, and its potential applications in text understanding.

international joint conference on artificial intelligence | 2011

Short text conceptualization using a probabilistic knowledgebase

Yangqiu Song; Haixun Wang; Zhongyuan Wang; Hongsong Li; Weizhu Chen

Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches lies in short text understanding, as short texts lack enough content from which statistical conclusions can be drawn easily. In this paper, we improve text understanding by using a probabilistic knowledgebase that is as rich as our mental world in terms of the concepts (of worldly facts) it contains. We then develop a Bayesian inference mechanism to conceptualize words and short text. We conducted comprehensive experiments on conceptualizing textual terms, and clustering short pieces of text such as Twitter messages. Compared to purely statistical methods such as latent semantic topic modeling or methods that use existing knowledge-bases (e.g., WordNet, Freebase and Wikipedia), our approach brings significant improvements in short text understanding as reflected by the clustering accuracy.

international conference on management of data | 2010

MapDupReducer: detecting near duplicates over massive datasets

Changping Wang; Jianmin Wang; Xuemin Lin; Wei Wang; Haixun Wang; Hongsong Li; Wanpeng Tian; Jun Xu; Rui Li

Near duplicate detection benefits many applications, e.g., on-line news selection over the Web by keyword search. The purpose of this demo is to show the design and implementation of MapDupReducer, a MapReduce based system capable of detecting near duplicates over massive datasets efficiently.

IEEE Transactions on Knowledge and Data Engineering | 2018

Diagnosing and Minimizing Semantic Drift in Iterative Bootstrapping Extraction

Zhixu Li; Ying He; Binbin Gu; An Liu; Hongsong Li; Haixun Wang; Xiaofang Zhou

Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the “symptoms” of semantic drift, then DPs are the “causes” of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to describe our approach in cleaning information extraction errors caused by semantic drift, but we perform experiments on different relation extraction processes on three large real data extraction collections. The experimental results show that our DP cleaning method enables us to clean around 90 percent incorrect instances or patterns with about 90 percent precision, which outperforms the previous approaches we compare with.

IEEE Transactions on Knowledge and Data Engineering | 2017

Semantic Bootstrapping: A Theoretical Perspective

Wentao Wu; Hongsong Li; Haixun Wang; Kenny Q. Zhu

Knowledge acquisition is an iterative process. Most prior work used syntactic bootstrapping approaches, while semantic bootstrapping was proposed recently. Unlike syntactic bootstrapping, semantic bootstrapping bootstraps directly on knowledge rather than on syntactic patterns, that is, it uses existing knowledge to understand the text and acquire more knowledge. It has been shown that semantic bootstrapping can achieve superb precision while retaining good recall on extracting isA relation. Nonetheless, the working mechanism of semantc bootstrapping remains elusive. In this extended abstract, we present a theoretical analysis as well as an experimental study to provide deeper insights into semantic bootstrapping.

extending database technology | 2014

Overcoming Semantic Drift in Information Extraction

Zhixu Li; Hongsong Li; Haixun Wang; Yi Yang; Xiangliang Zhang; Xiaofang Zhou

Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the “symptoms” of semantic drift, then DPs are the “causes” of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to demonstrate the effectiveness of our approach in cleaning information extraction errors caused by semantic drift. We perform experiments on a isA relation iterative extraction, where 90.5 million of isA pairs are automatically extracted from 1.6 billion web documents with a low precision. The experimental results show our DP cleaning method enables us to clean more than 90% incorrect instances with 95% precision, which outperforms the previous approaches we compare with. As a result, our method greatly improves the prevision of this large isA data set from less than 50% to over 90%.

Archive | 2011