Gabriela Ferraro
Pompeu Fabra University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gabriela Ferraro.
conference on computational natural language learning | 2015
Lizhen Qu; Gabriela Ferraro; Liyuan Zhou; Weiwei Hou; Nathan Schneider; Timothy Baldwin
Word embeddings -- distributed word representations that can be learned from unlabelled data -- have been shown to have high utility in many natural language processing applications. In this paper, we perform an extrinsic evaluation of five popular word embedding methods in the context of four sequence labelling tasks: POS-tagging, syntactic chunking, NER and MWE identification. A particular focus of the paper is analysing the effects of task-based updating of word representations. We show that when using word embeddings as features, as few as several hundred training instances are sufficient to achieve competitive results, and that word embeddings lead to improvements over OOV words and out of domain. Perhaps more surprisingly, our results indicate there is little difference between the different word embedding methods, and that simple Brown clusters are often competitive with word embeddings across all tasks we consider.
JMIR medical informatics | 2015
Hanna Suominen; Liyuan Zhou; Leif Hanlen; Gabriela Ferraro
Background Over a tenth of preventable adverse events in health care are caused by failures in information flow. These failures are tangible in clinical handover; regardless of good verbal handover, from two-thirds to all of this information is lost after 3-5 shifts if notes are taken by hand, or not at all. Speech recognition and information extraction provide a way to fill out a handover form for clinical proofing and sign-off. Objective The objective of the study was to provide a recorded spoken handover, annotated verbatim transcriptions, and evaluations to support research in spoken and written natural language processing for filling out a clinical handover form. This dataset is based on synthetic patient profiles, thereby avoiding ethical and legal restrictions, while maintaining efficacy for research in speech-to-text conversion and information extraction, based on realistic clinical scenarios. We also introduce a Web app to demonstrate the system design and workflow. Methods We experiment with Dragon Medical 11.0 for speech recognition and CRF++ for information extraction. To compute features for information extraction, we also apply CoreNLP, MetaMap, and Ontoserver. Our evaluation uses cross-validation techniques to measure processing correctness. Results The data provided were a simulation of nursing handover, as recorded using a mobile device, built from simulated patient records and handover scripts, spoken by an Australian registered nurse. Speech recognition recognized 5276 of 7277 words in our 100 test documents correctly. We considered 50 mutually exclusive categories in information extraction and achieved the F1 (ie, the harmonic mean of Precision and Recall) of 0.86 in the category for irrelevant text and the macro-averaged F1 of 0.70 over the remaining 35 nonempty categories of the form in our 101 test documents. Conclusions The significance of this study hinges on opening our data, together with the related performance benchmarks and some processing software, to the research and development community for studying clinical documentation and language-processing. The data are used in the CLEFeHealth 2015 evaluation laboratory for a shared task on speech recognition.
international acm sigir conference on research and development in information retrieval | 2015
Mona Golestan Far; Scott Sanner; Mohamed Reda Bouadjenek; Gabriela Ferraro; David Hawking
In this paper, we investigate the influence of term selection on retrieval performance on the CLEF-IP prior art test collection, using the Description section of the patent query with Language Model (LM) and BM25 scoring functions. We find that an oracular relevance feedback system that extracts terms from the judged relevant documents far outperforms the baseline and performs twice as well on MAP as the best competitor in CLEF-IP 2010. We find a very clear term selection value threshold for use when choosing terms. We also noticed that most of the useful feedback terms are actually present in the original query and hypothesized that the baseline system could be substantially improved by removing negative query terms. We tried four simple automated approaches to identify negative terms for query reduction but we were unable to notably improve on the baseline performance with any of them. However, we show that a simple, minimal interactive relevance feedback approach where terms are selected from only the first retrieved relevant document outperforms the best result from CLEF-IP 2010 suggesting the promise of interactive methods for term selection in patent prior art search.
Knowledge Based Systems | 2011
Gabriela Ferraro; Leo Wanner
Research on the extraction of content relations from text corpora is a high-priority topic in natural language processing. This is not surprising since content relations form the backbone of any ontology, and ontologies are increasingly made use of in knowledge-based applications. However, so far most of the works focus on the detection of a restricted number of prominent verbal relations, including in particular IS-A, HAS-PART and CAUSE. Our application, which aims to provide comprehensive, easy-to-understand content representations of complex functional objects described in patent claims, faces the need to derive a large number of content relations that cannot be limited a priori. To cope with this problem, we take advantage of the fact that deep syntactic dependency structures of sentences capture all relevant content relations—although without any abstraction. We implement thus a three-step strategy. First, we parse the claims to retrieve the deep syntactic dependency structures from which we then derive the content relations. Second, we generalize the obtained relations by clustering them according to semantic criteria, with the goal to unite all sufficiently similar relations. Finally, we identify a suitable name for each generalized relation. To keep the scope of the article within reasonable limits and to allow for a comparison with state-of-the-art techniques, we focus on verbal relations.
international conference on artificial intelligence and law | 2015
Mohamed Reda Bouadjenek; Scott Sanner; Gabriela Ferraro
Patents are used by legal entities to legally protect their inventions and represent a multi-billion dollar industry of licensing and litigation. In 2014, 326,033 patent applications were approved in the US alone -- a number that has doubled in the past 15 years and which makes prior art search a daunting, but necessary task in the patent application process. In this work, we seek to investigate the efficacy of prior art search strategies from the perspective of the inventor who wishes to assess the patentability of their ideas prior to writing a full application. While much of the literature inspired by the evaluation framework of the CLEF-IP competition has aimed to assist patent examiners in assessing prior art for complete patent applications, less of this work has focused on patent search with queries representing partial applications. In the (partial) patent search setting, a query is often much longer than in other standard IR tasks, e.g., the description section may contain hundreds or even thousands of words. While the length of such queries may suggest query reduction strategies to remove irrelevant terms, intentional obfuscation and general language used in patents suggests that it may help to expand queries with additionally relevant terms. To assess the trade-offs among all of these pre-application prior art search strategies, we comparatively evaluate a variety of partial application search and query reformulation methods. Among numerous findings, querying with a full description, perhaps in conjunction with generic (non-patent specific) query reduction methods, is recommended for best performance. However, we also find that querying with an abstract represents the best trade-off in terms of writing effort vs. retrieval efficacy (i.e., querying with the description sections only lead to marginal improvements) and that for such relatively short queries, generic query expansion methods help.
language resources and evaluation | 2014
Gabriela Ferraro; Rogelio Nazar; Margarita Alonso Ramos; Leo Wanner
Collocations in the sense of idiosyncratic binary lexical co-occurrences are one of the biggest challenges for any language learner. Even advanced learners make collocation mistakes in that they literally translate collocation elements from their native tongue, create new words as collocation elements, choose a wrong subcategorization for one of the elements, etc. Therefore, automatic collocation error detection and correction is increasingly in demand. However, while state-of-the-art models predict, with a reasonable accuracy, whether a given co-occurrence is a valid collocation or not, only few of them manage to suggest appropriate corrections with an acceptable hit rate. Most often, a ranked list of correction options is offered from which the learner has then to choose. This is clearly unsatisfactory. Our proposal focuses on this critical part of the problem in the context of the acquisition of Spanish as second language. For collocation error detection, we use a frequency-based technique. To improve on collocation error correction, we discuss three different metrics with respect to their capability to select the most appropriate correction of miscollocations found in our learner corpus.
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) | 2014
Gabriela Ferraro; Hanna Suominen; Jaume Nualart
Good readability of text is important to ensure efficiency in communication and eliminate risks of misunderstanding. Patent claims are an example of text whose readability is often poor. In this paper, we aim to improve claim readability by a clearer presentation of its content. Our approach consist in segmenting the original claim content at two levels. First, an entire claim is segmented to the components of preamble, transitional phrase and body, using a rule-based approach. Second, a conditional random field is trained to segment the components into clauses. An alternative approach would have been to modify the claim content which is, however, prone to also changing the meaning of this legal text. For both segmentation levels, we report results from statistical evaluation of segmentation performance. In addition, a qualitative error analysis was performed to understand the problems underlying the clause segmentation task. Our accuracy in detecting the beginning and end of preamble text is 1.00 and 0.97, respectively. For the transitional phase, these numbers are 0.94 and 1.00 and for the body text, 1.00 and 1.00. Our precision and recall in the clause segmentation are 0.77 and 0.76, respectively. The results give evidence for the feasibility of automated claim and clause segmentation, which may help not only inventors, researchers, and other laypeople to understand patents but also patent experts to avoid future legal cost due to litigations.
asia information retrieval symposium | 2015
Dilesha Seneviratne; Shlomo Geva; Guido Zuccon; Gabriela Ferraro; Timothy Chappell; Magali Rezende Gouvêa Meireles
We propose a document signature approach to patent classification. Automatic patent classification is a challenging task because of the fast growing number of patent applications filed every year and the complexity, size and nested hierarchical structure of patent taxonomies. In our proposal, the classification of a target patent is achieved through a k-nearest neighbour search using Hamming distance on signatures generated from patents; the classification labels of the retrieved patents are weighted and combined to produce a patent classification code for the target patent. The use of this method is motivated by the fact that intuitively document signatures are more efficient than previous approaches for this task that considered the training of classifiers on the whole vocabulary feature set. Our empirical experiments also demonstrate that the combination of document signatures and k-nearest neighbours search improves classification effectiveness, provided that enough data is used to generate signatures.
australasian document computing symposium | 2014
Donglu Wang; Gabriela Ferraro; Hanna Suominen; Osmat Azzam Jefferson
Debates on gene patents have necessitated the analysis of patents that disclose and reference human sequences. In this study, we built an automated classifier that assigns sequences to one of nine predefined categories according to their functional roles in patent claims by applying natural language processing and supervised learning techniques. To improve its correctness, we experimented with various feature mappings, resulting in the maximal accuracy of 79%.
international conference on artificial intelligence and law | 2009
Nadjet Bouayad-Agha; Gerard Casamayor; Gabriela Ferraro; Simon Mille; Vanesa Vidal; Leo Wanner
Collaboration
Dive into the Gabriela Ferraro's collaboration.
Magali Rezende Gouvêa Meireles
Pontifícia Universidade Católica de Minas Gerais
View shared research outputs