Arantza Díaz de Ilarraza
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arantza Díaz de Ilarraza.
conference on intelligent text processing and computational linguistics | 2004
Itziar Aduriz; Maxux J. Aranzabe; Jose Maria Arriola; Arantza Díaz de Ilarraza; Koldo Gojenola; Maite Oronoz; Larraitz Uria
This article presents a robust syntactic analyser for Basque and the different modules it contains. Each module is structured in different analysis layers for which each layer takes the information provided by the previous layer as its input; thus creating a gradually deeper syntactic analysis in cascade. This analysis is carried out using the Constraint Grammar (CG) formalism. Moreover, the article describes the standardisation process of the parsing formats using XML.
international conference on computational linguistics | 2009
Iñaki Alegria; Arantza Díaz de Ilarraza; Gorka Labaka; Mikel Lersundi; Aingeru Mayor; Kepa Sarasola
We present an open architecture we have designed in a project for machine translation from Spanish into Basque based on rules. The main objective has been the construction of an open, reusable and interoperable framework which can be improved in the next future combining it with the statistical model. The MT architecture reuses several open tools and it is based on an unique XML format for the flow between the different modules, which makes easer the interaction among different developers of tools and resources. Being Basque a resource-poor language this is a key feature in our aim for future improvements and extensions of the engine. The result is an open source software which can be downloaded from matxin.sourceforge.net,and we think it could be adapted to translating between other languages with few resources.
Machine Translation | 2011
Aingeru Mayor; Iñaki Alegria; Arantza Díaz de Ilarraza; Gorka Labaka; Mikel Lersundi; Kepa Sarasola
We present the first publicly available machine translation (MT) system for Basque. The fact that Basque is both a morphologically rich and less-resourced language makes the use of statistical approaches difficult, and raises the need to develop a rule-based architecture which can be combined in the future with statistical techniques. The MT architecture proposed reuses several open-source tools and is based on a unique XML format to facilitate the flow between the different modules, which eases the interaction among different developers of tools and resources. The result is the rule-based Matxin MT system, an open-source toolkit, whose first implementation translates from Spanish to Basque. We have performed innovative work on the following tasks: construction of a dependency analyser for Spanish, use of rich linguistic information to translate prepositions and syntactic functions (such as subject and object markers), construction of an efficient module for verbal chunk transfer, and design and implementation of modules for ordering words and phrases, independently of the source language.
Journal of Biomedical Informatics | 2015
Maite Oronoz; Koldo Gojenola; Alicia Pérez; Arantza Díaz de Ilarraza; Arantza Casillas
The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning.
intelligent tutoring systems | 2004
Iraide Zipitria; Jon A. Elorriaga; Ana Arruarte; Arantza Díaz de Ilarraza
One of the goals remaining in Intelligent Tutoring Systems is to create applications to evaluate open-ended text in a human-like manner. The aim of this study is to produce the design for a fully automatic summary evaluation system that could stand for human-like summarisation assessment. In order to gain this goal, an empirical study has been carried out to identify underlying cognitive processes. The studied sample is compound by 15 expert raters on summary evaluation with different professional backgrounds in education. Pearson’s correlation has been calculated to see inter-rater agreement level and stepwise linear regression to observe predicting variables and weights. In addition, interviews with subjects provided qualitative information that could not be acquired numerically. Based on this research, a design of a fully automatic summary evaluation environment has been described.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Xabier Artola; Arantza Díaz de Ilarraza; Aitor Soroa; Aitor Sologaistoa
In this paper we present AWA, a general purpose Annotation Web Architecture for representing, storing, and accessing the information produced by different linguistic processors. The objective of AWA is to establish a coherent and flexible representation scheme that will be the basis for the exchange and use of linguistic information. In morphologically-rich languages as Basque it is necessary to represent and provide easy access to complex phenomena such as intraword structure, declension, derivation and composition features, constituent discontinousness (in multiword expressions) and so on. AWA provides a well-suited schema to deal with these phenomena. The annotation model relies on XML technologies for data representation, storage and retrieval. Typed feature structures are used as a representation schema for linguistic analyses. A consistent underlying data model, which captures the structure and relations contained in the information to be manipulated, has been identified and implemented. AWA is integrated into LPAF, a multilayered Language Processing and Annotation Framework, whose goal is the management and integration of diverse NLP components and resources. Moreover, we introduce EULIA, an annotation tool which exploits and manipulates the data created by the linguistic processors. Two real corpora have been processed and annotated within this framework.
finite-state methods and natural language processing | 2005
Iñaki Alegria; Arantza Díaz de Ilarraza; Gorka Labaka; Mikel Lersundi; Aingeru Mayor; Kepa Sarasola
We are developing an Spanish-Basque MT system using the traditional transfer model and based on shallow and dependency parsing. The project is based on the previous work of our group but integrated in OpenTrad initiative [2]. This abstract sumarizes the current status of development of an FST grammar for the structural transfer of verb chains. This task is quite complex due to the high distance between both languages. In the actual implementation we are using XRCE Finite States Tools [1].
international conference on computational linguistics | 2005
Arantza Díaz de Ilarraza; Koldo Gojenola; Maite Oronoz
This paper presents the design and development of a system for the detection and correction of syntactic errors in free texts. The system is composed of three main modules: a) a robust syntactic analyser, b) a compiler that will translate error processing rules, and c) a module that coordinates the results of the analyser, applying different combinations of the already compiled error rules. The use of the syntactic analyser (a) and the rule processor (b) is independent and not necessarily sequential. The specification language used for the description of the error detection/correction rules is abstract, general, declarative, and based on linguistic information.
international conference on computational linguistics | 2002
Arantza Díaz de Ilarraza; Aingeru Mayor; Kepa Sarasola
This paper presents the strategy and design of a highly efficient semiautomatic method for labelling the semantic features of common nouns, using semantic relationships between words, and based on the information extracted from an electronic monolingual dictionary. The method, that uses genus data, specific relators and synonymy information, obtains an accuracy of over 99% and a scope of 68,2% with regard to all the common nouns contained in a real corpus of over 1 million words, after the manual labelling of only 100 nouns.
Corpus Linguistics and Linguistic Theory | 2015
Mikel Iruskieta; Arantza Díaz de Ilarraza; Mikel Lersundi
Abstract This article presents a discourse annotation methodology based on Rhetorical Structure Theory and an empirical study of annotating a corpus of specialized medical texts in Basque. The annotation process includes two phases: segmentation and annotation of rhetorical relations. Phase one entails an initial study which leads to establishing linguistic criteria for sentence-based segmentation; a second phase focuses on annotation of rhetorical relations. After establishing discourse segments and rhetorical relations, the annotation process is analyzed and evaluated by means of the method commonly used in RST (Marcu 2000). Inconsistencies detected in the evaluation method lead the authors to redefine some criteria of the evaluation method. As a result of this work, a small annotated Basque-language corpus is provided to scientific community.