Óscar Ferrández
University of Alicante
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Óscar Ferrández.
Journal of the American Medical Informatics Association | 2013
Óscar Ferrández; Brett R. South; Shuying Shen; F. Jeffrey Friedlin; Matthew H. Samore; Stéphane M. Meystre
OBJECTIVE De-identification allows faster and more collaborative clinical research while protecting patient confidentiality. Clinical narrative de-identification is a tedious process that can be alleviated by automated natural language processing methods. The goal of this research is the development of an automated text de-identification system for Veterans Health Administration (VHA) clinical documents. MATERIALS AND METHODS We devised a novel stepwise hybrid approach designed to improve the current strategies used for text de-identification. The proposed system is based on a previous study on the best de-identification methods for VHA documents. This best-of-breed automated clinical text de-identification system (aka BoB) tackles the problem as two separate tasks: (1) maximize patient confidentiality by redacting as much protected health information (PHI) as possible; and (2) leave de-identified documents in a usable state preserving as much clinical information as possible. RESULTS We evaluated BoB with a manually annotated corpus of a variety of VHA clinical notes, as well as with the 2006 i2b2 de-identification challenge corpus. We present evaluations at the instance- and token-level, with detailed results for BoBs main components. Moreover, an existing text de-identification system was also included in our evaluation. DISCUSSION BoBs design efficiently takes advantage of the methods implemented in its pipeline, resulting in high sensitivity values (especially for sensitive PHI categories) and a limited number of false positives. CONCLUSIONS Our system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.
meeting of the association for computational linguistics | 2007
Óscar Ferrández; Daniel Micol; Rafael Muñoz; Manuel Palomar
The textual entailment recognition system that we discuss in this paper represents a perspective-based approach composed of two modules that analyze text-hypothesis pairs from a strictly lexical and syntactic perspectives, respectively. We attempt to prove that the textual entailment recognition task can be overcome by performing individual analysis that acknowledges us of the maximum amount of information that each single perspective can provide. We compare this approach with the system we presented in the previous edition of PASCAL Recognising Textual Entailment Challenge, obtaining an accuracy rate 17.98% higher.
data and knowledge engineering | 2007
Zornitsa Kozareva; Óscar Ferrández; Andrés Montoyo; Rafael Muñoz; Armando Suárez; Jaime Gómez
The increasing flow of digital information requires the extraction, filtering and classification of pertinent information from large volumes of texts. All these tasks greatly benefit from involving a Named Entity Recognizer (NER) in the preprocessing stage. This paper proposes a completely automatic NER system. The NER task involves not only the identification of proper names (Named Entities) in natural language text, but also their classification into a set of predefined categories, such as names of persons, organizations (companies, government organizations, committees, etc.), locations (cities, countries, rivers, etc.) and miscellaneous (movie titles, sport events, etc.). Throughout the paper, we examine the differences between language models learned by different data-driven classifiers confronted with the same NLP task, as well as ways to exploit these differences to yield a higher accuracy than the best individual classifier. Three machine learning classifiers (Hidden Markov Model, Maximum Entropy and Memory Based Learning) are trained on the same corpus in order to resolve the NE task. After comparison, their output is combined using voting strategies. A comprehensive study and experimental work on the evaluation of our system, as well as a comparison with other systems has been carried out within the framework of two specialized scientific competitions for NER, CoNLL-2002 and HAREM-2005. Finally, this paper describes the integration of our NER system in different NLP applications, in concrete Geographic Information Retrieval and Conceptual Modelling.
Journal of Web Semantics | 2011
Óscar Ferrández; Christian Spurk; Milen Kouylekov; Iustin Dornescu; Sergio Ferrández; Matteo Negri; Rubén Izquierdo; David Tomás; Constantin Orasan; Guenter Neumann; Bernardo Magnini; José L. Vicedo
This paper presents the QALL-ME Framework, a reusable architecture for building multi- and cross-lingual Question Answering (QA) systems working on structured data modelled by an ontology. It is released as free open source software with a set of demo components and extensive documentation, which makes it easy to use and adapt. The main characteristics of the QALL-ME Framework are: (i) its domain portability, achieved by an ontology modelling the target domain; (ii) the context awareness regarding space and time of the question; (iii) the use of textual entailment engines as the core of the question interpretation; and (iv) an architecture based on Service Oriented Architecture (SOA), which is realized using interchangeable web services for the framework components. Furthermore, we present a running example to clarify how the framework processes questions as well as a case study that shows a QA application built as an instantiation of the QALL-ME Framework for cinema/movie events in the tourism domain.
BMC Medical Research Methodology | 2012
Óscar Ferrández; Brett R. South; Shuying Shen; F. Jeffrey Friedlin; Matthew H. Samore; Stéphane M. Meystre
BackgroundThe increased use and adoption of Electronic Health Records (EHR) causes a tremendous growth in digital information useful for clinicians, researchers and many other operational purposes. However, this information is rich in Protected Health Information (PHI), which severely restricts its access and possible uses. A number of investigators have developed methods for automatically de-identifying EHR documents by removing PHI, as specified in the Health Insurance Portability and Accountability Act “Safe Harbor” method.This study focuses on the evaluation of existing automated text de-identification methods and tools, as applied to Veterans Health Administration (VHA) clinical documents, to assess which methods perform better with each category of PHI found in our clinical notes; and when new methods are needed to improve performance.MethodsWe installed and evaluated five text de-identification systems “out-of-the-box” using a corpus of VHA clinical documents. The systems based on machine learning methods were trained with the 2006 i2b2 de-identification corpora and evaluated with our VHA corpus, and also evaluated with a ten-fold cross-validation experiment using our VHA corpus. We counted exact, partial, and fully contained matches with reference annotations, considering each PHI type separately, or only one unique ‘PHI’ category. Performance of the systems was assessed using recall (equivalent to sensitivity) and precision (equivalent to positive predictive value) metrics, as well as the F2-measure.ResultsOverall, systems based on rules and pattern matching achieved better recall, and precision was always better with systems based on machine learning approaches. The highest “out-of-the-box” F2-measure was 67% for partial matches; the best precision and recall were 95% and 78%, respectively. Finally, the ten-fold cross validation experiment allowed for an increase of the F2-measure to 79% with partial matches.ConclusionsThe “out-of-the-box” evaluation of text de-identification systems provided us with compelling insight about the best methods for de-identification of VHA clinical documents. The errors analysis demonstrated an important need for customization to PHI formats specific to VHA documents. This study informed the planning and development of a “best-of-breed” automatic de-identification application for VHA clinical text.
Journal of Biomedical Informatics | 2014
Brett R. South; Danielle L. Mowery; Ying Suo; Jianwei Leng; Óscar Ferrández; Stéphane M. Meystre; Wendy W. Chapman
The Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method requires removal of 18 types of protected health information (PHI) from clinical documents to be considered “de-identified” prior to use for research purposes. Human review of PHI elements from a large corpus of clinical documents can be tedious and error-prone. Indeed, multiple annotators may be required to consistently redact information that represents each PHI class. Automated de-identification has the potential to improve annotation quality and reduce annotation time. For instance, using machine-assisted annotation by combining de-identification system outputs used as pre-annotations and an interactive annotation interface to provide annotators with PHI annotations for “curation” rather than manual annotation from “scratch” on raw clinical documents. In order to assess whether machine-assisted annotation improves the reliability and accuracy of the reference standard quality and reduces annotation effort, we conducted an annotation experiment. In this annotation study, we assessed the generalizability of the VA Consortium for Healthcare Informatics Research (CHIR) annotation schema and guidelines applied to a corpus of publicly available clinical documents called MTSamples. Specifically, our goals were to (1) characterize a heterogeneous corpus of clinical documents manually annotated for risk-ranked PHI and other annotation types (clinical eponyms and person relations), (2) evaluate how well annotators apply the CHIR schema to the heterogeneous corpus, (3) compare whether machine-assisted annotation (experiment) improves annotation quality and reduces annotation time compared to manual annotation (control), and (4) assess the change in quality of reference standard coverage with each added annotator’s annotations.
applications of natural language to data bases | 2007
Sergio Ferrández; Antonio Toral; Óscar Ferrández; Antonio Ferrández; Rafael Muñoz
The application of the multilingual knowledge encoded in Wikipedia to an open-domain Cross-Lingual Question Answering system based on the Inter Lingual Index (ILI) module of EuroWordNet is proposed and evaluated. This strategy overcomes the problems due to ILIs low coverage on proper nouns (Named Entities). Moreover, as these are open class words (highly changing), using a community-based up-to-date resource avoids the tedious maintenance of hand-coded bilingual dictionaries. A study reveals the importance to translate Named Entities in CL-QA and the advantages of relying on Wikipedia over ILI for doing this. Tests on questions from the Cross-Language Evaluation Forum (CLEF) justify our approach (20% of these are correctly answered thanks to Wikipedias Multilingual Knowledge).
Information Sciences | 2008
Estela Saquete; Óscar Ferrández; Sergio Ferrández; Patricio Martínez-Barco; Rafael Muñoz
This paper presents an improvement in the temporal expression (TE) recognition phase of a knowledge based system at a multilingual level. For this purpose, the combination of different approaches applied to the recognition of temporal expressions are studied. In this work, for the recognition task, a knowledge based system that recognizes temporal expressions and had been automatically extended to other languages (TERSEO system) was combined with a system that recognizes temporal expressions using machine learning techniques. In particular, two different techniques were applied: maximum entropy model (ME) and hidden Markov model (HMM), using two different types of tagging of the training corpus: (1) BIO model tagging of literal temporal expressions and (2) BIO model tagging of simple patterns of temporal expressions. Each system was first evaluated independently and then combined in order to: (a) analyze if the combination gives better results without increasing the number of erroneous expressions in the same percentage and (b) decide which machine learning approach performs this task better. When the TERSEO system is combined with the maximum entropy approach the best results for F-measure (89%) are obtained, improving TERSEO recognition by 4.5 points and ME recognition by 7.
international conference natural language processing | 2005
Zornitsa Kozareva; Óscar Ferrández; Andrés Montoyo; Rafael Muñoz; Armando Suárez
The increasing flow of digital information requires the extraction, filtering and classification of pertinent information from large volumes of texts. An important preprocessing tool of these tasks consists of name entities recognition, which corresponds to a Name Entity Recognition (NER) task. In this paper we propose a completely automatic NER which involves identification of proper names in texts, and classification into a set of predefined categories of interest as Person names, Organizations (companies, government organizations, committees, etc.) and Locations (cities, countries, rivers, etc). We examined the differences in language models learned by different data-driven systems performing the same NLP tasks and how they can be exploited to yield a higher accuracy than the best individual system. Three NE classifiers (Hidden Markov Models, Maximum Entropy and Memory-based learner) are trained on the same corpus data and after comparison their outputs are combined using voting strategy. Results are encouraging since 98.5% accuracy for recognition and 84.94% accuracy for classification of NE for Spanish language were achieved.
applications of natural language to data bases | 2007
Óscar Ferrández; Daniel Micol; Rafael Muñoz; Manuel Palomar
This paper discusses the recognition of textual entailment in a text-hypothesis pair by applying a wide variety of lexical measures. We consider that the entailment phenomenon can be tackled from three general levels: lexical, syntactic and semantic. The main goals of this research are to deal with this phenomenon from a lexical point of view, and achieve high results considering only such kind of knowledge. To accomplish this, the information provided by the lexical measures is used as a set of features for a Support Vector Machine which will decide if the entailment relation is produced. A study of the most relevant features and a comparison with the best state-of-the-art textual entailment systems is exposed throughout the paper. Finally, the system has been evaluated using the Second PASCAL Recognising Textual Entailment Challenge data and evaluation methodology, obtaining an accuracy rate of 61.88%.