Radu Florian
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Radu Florian.
north american chapter of the association for computational linguistics | 2003
Radu Florian; Abraham Ittycheriah; Hongyan Jing; Tong Zhang
This paper presents a classifier-combination experimental framework for named entity recognition in which four diverse classifiers (robust linear classifier, maximum entropy, transformation-based learning, and hidden Markov model) are combined under different conditions. When no gazetteer or other additional training resources are used, the combined system attains a performance of 91.6F on the English development data; integrating name, location and person gazetteers, and named entity systems trained on additional, more general, data reduces the F-measure error by a factor of 15 to 21% on the English data.
meeting of the association for computational linguistics | 2005
Imed Zitouni; Jeffrey S. Sorensen; Xiaoqiang Luo; Radu Florian
Arabic presents an interesting challenge to natural language processing, being a highly inflected and agglutinative language. In particular, this paper presents an in-depth investigation of the entity detection and recognition (EDR) task for Arabic. We start by highlighting why segmentation is a necessary prerequisite for EDR, continue by presenting a finite-state statistical segmenter, and then examine how the resulting segments can be better included into a mention detection system and an entity recognition system; both systems are statistical, build around the maximum entropy principle. Experiments on a clearly stated partition of the ACE 2004 data show that stem-based features can significantly improve the performance of the EDT system by 2 absolute F-measure points. The system presented here had a competitive performance in the ACE 2004 evaluation.
empirical methods in natural language processing | 2008
Imed Zitouni; Radu Florian
While significant effort has been put into annotating linguistic resources for several languages, there are still many left that have only small amounts of such resources. This paper investigates a method of propagating information (specifically mention detection information) into such low resource languages from richer ones. Experiments run on three language pairs (Arabic-English, Chinese-English, and Spanish-English) show that one can achieve relatively decent performance by propagating information from a language with richer resources such as English into a foreign language alone (no resources or models in the foreign language). Furthermore, while examining the performance using various degrees of linguistic information in a statistical framework, results show that propagated features from English help improve the source-language system performance even when used in conjunction with all feature types built from the source language. The experiments also show that using propagated features in conjunction with lexically-derived features only (as can be obtained directly from a mention annotated corpus) yields similar performance to using feature types derived from many linguistic resources.
empirical methods in natural language processing | 2003
Hongyan Jing; Radu Florian; Xiaoqiang Luo; Tong Zhang; Abraham Ittycheriah
When building a Chinese named entity recognition system, one must deal with certain language-specific issues such as whether the model should be based on characters or words. While there is no unique answer to this question, we discuss in detail advantages and disadvantages of each model, identify problems in segmentation and suggest possible solutions, presenting our observations, analysis, and experimental results. The second topic of this paper is classifier combination. We present and describe four classifiers for Chinese named entity recognition and describe various methods for combining their outputs. The results demonstrate that classifier combination is an effective technique of improving system performance: experiments over a large annotated corpus of fine-grained entity types exhibit a 10% relative reduction in F-measure error.
ACM Transactions on Asian Language Information Processing | 2009
Imed Zitouni; Radu Florian
In the last two decades, significant effort has been put into annotating linguistic resources in several languages. Despite this valiant effort, there are still many languages left that have only small amounts of such resources. The goal of this article is to present and investigate a method of propagating information (specifically mention detection) from a resource-rich language into a relatively resource-poor language such as Arabic. Part of the investigation is to quantify the contribution of propagating information in different conditions based on the availability of resources in the target language. Experiments on the language pair Arabic-English show that one can achieve relatively decent performance by propagating information from a language with richer resources such as English into Arabic alone (no resources or models in the source language Arabic). Furthermore, results show that propagated features from English do help improve the Arabic system performance even when used in conjunction with all feature types built from the source language. Experiments also show that using propagated features in conjunction with lexically derived features only (as can be obtained directly from a mention annotated corpus) brings the system performance at the one obtained in the target language by using feature derived from many linguistic resources, therefore improving the system when such resources are not available. In addition to Arabic-English language pair, we investigate the effectiveness of our approach on other language pairs such as Chinese-English and Spanish-English.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Imed Zitouni; Xiaoqiang Luo; Radu Florian
This paper presents a fully statistical approach to Arabic mention detection and chaining system, built around the maximum entropy principle. The presented system takes a cascade approach to processing an input document, by first detecting mentions in the document and then chaining the identified mentions into entities. Both system components use a common maximum entropy framework, which allows the integration of a large array of feature types, including lexical, morphological, syntactic, and semantic features. Arabic offers additional challenges for this task (when compared with English, for example), as segmentation is a needed processing step, so one can correctly identify and resolve enclitic pronouns. The system presented has obtained very competitive performance in the automatic content extraction (ACE) evaluation program.
meeting of the association for computational linguistics | 2016
Avirup Sil; Radu Florian
Entity linking (EL) is the task of disambiguating mentions in text by associating them with entries in a predefined database of mentions (persons, organizations, etc). Most previous EL research has focused mainly on one language, English, with less attention being paid to other languages, such as Spanish or Chinese. In this paper, we introduce LIEL, a Language Independent Entity Linking system, which provides an EL framework which, once trained on one language, works remarkably well on a number of different languages without change. LIEL makes a joint global prediction over the entire document, employing a discriminative reranking framework with many domain and language-independent feature functions. Experiments on numerous benchmark datasets, show that the proposed system, once trained on one language, English, outperforms several state-of-the-art systems in English (by 4 points) and the trained model also works very well on Spanish (14 points better than a competitor system), demonstrating the viability of the approach.
north american chapter of the association for computational linguistics | 2003
Yaser Al-Onaizan; Radu Florian; Martin Franz; Hany Hassan; Young-Suk Lee; J. Scott McCarley; Kishore Papineni; Salim Roukos; Jeffrey S. Sorensen; Christoph Tillmann; Todd Ward; Fei Xia
Searching online information is increasingly a daily activity for many people. The multilinguality of online content is also increasing (e.g. the proportion of English web users, which has been decreasing as a fraction the increasing population of web users, dipped below 50% in the summer of 2001). To improve the ability of an English speaker to search mutlilingual content, we built a system that supports cross-lingual search of an Arabic newswire collection and provides on demand translation of Arabic web pages into English. The cross-lingual search engine supports a fast search capability (sub-second response for typical queries) and achieves state-of-the-art performance in the high precision region of the result list. The on demand statistical machine translation uses the Direct Translation model along with a novel statistical Arabic Morphological Analyzer to yield state-of-the-art translation quality. The on demand SMT uses an efficient dynamic programming decoder that achieves reasonable speed for translating web documents.
north american chapter of the association for computational linguistics | 2009
Xiaoqiang Luo; Radu Florian; Todd Ward
In this paper, we propose the use of metadata contained in documents to improve coreference resolution. Specifically, we quantify the impact of speaker and turn information on the performance of our coreference system, and show that the metadata can be effectively encoded as features of a statistical resolution system, which leads to a statistically significant improvement in performance.
Proceedings of BioNLP 15 | 2015
Ramesh Nallapati; Radu Florian
A typical NLP system for medical fact coding uses multiple layers of supervision involving factattributes, relations and coding. Training such a system involves expensive and laborious annotation process involving all layers of the pipeline. In this work, we investigate the feasibility of a shallow medical coding model that trains only on fact annotations, while disregarding fact-attributes and relations, potentially saving considerable annotation time and costs. Our results show that the shallow system, despite using less supervision, is only 1.4% F1 points behind the multi-layered system on Disorders, and contrary to expectation, is able to improve over the latter by about 2.4% F1 points on Procedure facts. Further, our experiments also show that training the shallow system using only sentence-level fact labels with no span information has no negative effect on performance, indicating further cost savings through weak supervision.