Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Cheng Niu is active.

Publication


Featured researches published by Cheng Niu.


north american chapter of the association for computational linguistics | 2003

InfoXtract: a customizable intermediate level information extraction engine

Rohini K. Srihari; Wei Li; Cheng Niu; Thomas L. Cornell

Information extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of documents drawn from various sources for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes a robust, scalable IE engine designed for such purposes. It describes new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information. These new tasks also facilitate the correlation of output from an IE engine with existing structured data. Benchmarking results for the core engine and applications utilizing the engine are presented.


meeting of the association for computational linguistics | 2004

Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction

Cheng Niu; Wei Li; Rohini K. Srihari

It is fairly common that different people are associated with the same name. In tracking person entities in a large document pool, it is important to determine whether multiple mentions of the same name across documents refer to the same entity or not. Previous approach to this problem involves measuring context similarity only based on co-occurring words. This paper presents a new algorithm using information extraction support in addition to co-occurring words. A learning scheme with minimal supervision is developed within the Bayesian framework. Maximum entropy modeling is then used to represent the probability distribution of context similarities based on heterogeneous features. Statistical annealing is applied to derive the final entity coreference chains by globally fitting the pairwise context similarities. Benchmarking shows that our new approach significantly outperforms the existing algorithm by 25 percentage points in overall F-measure.


meeting of the association for computational linguistics | 2003

A Bootstrapping Approach to Named Entity Classification Using Successive Learners

Cheng Niu; Wei Li; Jihong Ding; Rohini K. Srihari

This paper presents a new bootstrapping approach to named entity (NE) classification. This approach only requires a few common noun/pronoun seeds that correspond to the concept for the target NE type, e.g. he/she/man/woman for PERSON NE. The entire bootstrapping procedure is implemented as training two successive learners: (i) a decision list is used to learn the parsing-based high precision NE rules; (ii) a Hidden Markov Model is then trained to learn string sequence-based NE patterns. The second learner uses the training corpus automatically tagged by the first learner. The resulting NE system approaches supervised NE performance for some NE types. The system also demonstrates intuitive support for tagging user-defined NE types. The differences of this approach from the co-training-based NE bootstrapping are also discussed.


meeting of the association for computational linguistics | 2003

An Expert Lexicon Approach to Identifying English Phrasal Verbs

Wei Li; Xiuhong Zhang; Cheng Niu; Yuankai Jiang; Rohini K. Srihari

Phrasal Verbs are an important feature of the English language. Properly identifying them provides the basis for an English parser to decode the related structures. Phrasal verbs have been a challenge to Natural Language Processing (NLP) because they sit at the borderline between lexicon and syntax. Traditional NLP frameworks that separate the lexicon module from the parser make it difficult to handle this problem properly. This paper presents a finite state approach that integrates a phrasal verb expert lexicon between shallow parsing and deep parsing to handle morpho-syntactic interaction. With precision/recall combined performance benchmarked consistently at 95.8%-97.5%, the Phrasal Verb identification problem has basically been solved with the presented method.


Natural Language Engineering | 2008

Infoxtract: A customizable intermediate level information extraction engine

Rohini K. Srihari; Wei Li; Thomas L. Cornell; Cheng Niu

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtracts hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.


conference on computational natural language learning | 2005

Word Independent Context Pair Classification Model for Word Sense Disambiguation

Cheng Niu; Wei Li; Rohini K. Srihari; Huifeng Li

Traditionally, word sense disambiguation (WSD) involves a different context classification model for each individual word. This paper presents a weakly supervised learning approach to WSD based on learning a word independent context pair classification model. Statistical models are not trained for classifying the word contexts, but for classifying a pair of contexts, i.e. determining if a pair of contexts of the same ambiguous word refers to the same or different senses. Using this approach, annotated corpus of a target word A can be explored to disambiguate senses of a different word B. Hence, only a limited amount of existing annotated corpus is required in order to disambiguate the entire vocabulary. In this research, maximum entropy modeling is used to train the word independent context pair classification model. Then based on the context pair classification results, clustering is performed on word mentions extracted from a large raw corpus. The resulting context clusters are mapped onto the external thesaurus WordNet. This approach shows great flexibility to efficiently integrate heterogeneous knowledge sources, e.g. trigger words and parsing structures. Based on Senseval-3 Lexical Sample standards, this approach achieves state-of-the-art performance in the unsupervised learning category, and performs comparably with the supervised Naive Bayes system.


international conference on computational linguistics | 2002

Extracting exact answers to questions based on structural links

Wei Li; Rohini K. Srihari; Xiaoge Li; Munirathnam Srikanth; Xiuhong Zhang; Cheng Niu

This paper presents a novel approach to extracting phrase-level answers in a question answering system. This approach uses structural support provided by an integrated Natural Language Processing (NLP) and Information Extraction (IE) system. Both questions and the sentence-level candidate answer strings are parsed by this NLP/IE system into binary dependency structures. Phrase-level answer extraction is modelled by comparing the structural similarity involving the question-phrase and the candidate answer-phrase.There are two types of structural support. The first type involves predefined, specific entity associations such as Affiliation, Position, Age for a person entity. If a question asks about one of these associations, the answer-phrase can be determined as long as the system decodes such pre-defined dependency links correctly, despite the syntactic difference used in expressions between the question and used in expressions between the question and the candidate answer string. The second type involves generic grammatical relationships such as V-S (verb-subject), V-O (verb-object).Preliminary experimental results show an improvement in both precision and recall in extracting phrase-level answers, compared with a baseline system which only uses Named Entity constraints. The proposed methods are particularly effective in cases where the question-phrase does not correspond to a known named entity type and in cases where there are multiple candidate answer-phrases satisfying the named entity constraints.


International Journal on Artificial Intelligence Tools | 2004

ORTHOGRAPHIC CASE RESTORATION USING SUPERVISED LEARNING WITHOUT MANUAL ANNOTATION

Cheng Niu; Wei Li; Jihong Ding; Rohini K. Srihari

One challenge in text processing is the treatment of case insensitive documents such as speech recognition results. The traditional approach is to re-train a language model excluding case-related features. This paper presents an alternative two-step approach whereby a preprocessing module (Step 1) is designed to restore case-sensitive form which is subsequently processed by the original system (Step 2). Step 1 is mainly implemented as a Hidden Markov Model trained on a large raw corpus of case sensitive documents. It is demonstrated that this approach (i) outperforms the feature exclusion approach for named entity tagging, (ii) leads to limited degradation for parsing, relationship extraction and case insensitive question answering, (iii) reduces system complexity, and (iv) has wide applicability: the restored text can be used in both statistical model and rule-based systems.


north american chapter of the association for computational linguistics | 2003

Bootstrapping for named entity tagging using concept-based seeds

Cheng Niu; Wei Li; Jihong Ding; Rohini K. Srihari

A novel bootstrapping approach to Named Entity (NE) tagging using concept-based seeds and successive learners is presented. This approach only requires a few common noun or pronoun seeds that correspond to the concept for the targeted NE, e.g. he/she/man/woman for PERSON NE. The bootstrapping procedure is implemented as training two successive learners. First, decision list is used to learn the parsing-based NE rules. Then, a Hidden Markov Model is trained on a corpus automatically tagged by the first learner. The resulting NE system approaches supervised NE performance for some NE types.


meeting of the association for computational linguistics | 2003

Question Answering on a Case Insensitive Corpus

Wei Li; Rohini K. Srihari; Cheng Niu; Xiaoge Li

Most question answering (QA) systems rely on both keyword index and Named Entity (NE) tagging. The corpus from which the QA systems attempt to retrieve answers is usually mixed case text. However, there are numerous corpora that consist of case insensitive documents, e.g. speech recognition results. This paper presents a successful approach to QA on a case insensitive corpus, whereby a preprocessing module is designed to restore the case-sensitive form. The document pool with the restored case then feeds the QA system, which remains unchanged. The case restoration preprocessing is implemented as a Hidden Markov Model trained on a large raw corpus of case sensitive documents. It is demonstrated that this approach leads to very limited degradation in QA benchmarking (2.8%), mainly due to the limited degradation in the underlying information extraction support.

Collaboration


Dive into the Cheng Niu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge