Arantza Díaz de Ilarraza

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arantza Díaz de Ilarraza is active.

Explore More

Publication

Featured researches published by Arantza Díaz de Ilarraza.

conference on intelligent text processing and computational linguistics | 2004

A Cascaded Syntactic Analyser for Basque

Itziar Aduriz; Maxux J. Aranzabe; Jose Maria Arriola; Arantza Díaz de Ilarraza; Koldo Gojenola; Maite Oronoz; Larraitz Uria

This article presents a robust syntactic analyser for Basque and the different modules it contains. Each module is structured in different analysis layers for which each layer takes the information provided by the previous layer as its input; thus creating a gradually deeper syntactic analysis in cascade. This analysis is carried out using the Constraint Grammar (CG) formalism. Moreover, the article describes the standardisation process of the parsing formats using XML.

international conference on computational linguistics | 2009

Transfer-Based MT from Spanish into Basque: Reusability, Standardization and Open Source

Iñaki Alegria; Arantza Díaz de Ilarraza; Gorka Labaka; Mikel Lersundi; Aingeru Mayor; Kepa Sarasola

We present an open architecture we have designed in a project for machine translation from Spanish into Basque based on rules. The main objective has been the construction of an open, reusable and interoperable framework which can be improved in the next future combining it with the statistical model. The MT architecture reuses several open tools and it is based on an unique XML format for the flow between the different modules, which makes easer the interaction among different developers of tools and resources. Being Basque a resource-poor language this is a key feature in our aim for future improvements and extensions of the engine. The result is an open source software which can be downloaded from matxin.sourceforge.net,and we think it could be adapted to translating between other languages with few resources.

Machine Translation | 2011

Matxin, an open-source rule-based machine translation system for Basque

Aingeru Mayor; Iñaki Alegria; Arantza Díaz de Ilarraza; Gorka Labaka; Mikel Lersundi; Kepa Sarasola

We present the first publicly available machine translation (MT) system for Basque. The fact that Basque is both a morphologically rich and less-resourced language makes the use of statistical approaches difficult, and raises the need to develop a rule-based architecture which can be combined in the future with statistical techniques. The MT architecture proposed reuses several open-source tools and is based on a unique XML format to facilitate the flow between the different modules, which eases the interaction among different developers of tools and resources. The result is the rule-based Matxin MT system, an open-source toolkit, whose first implementation translates from Spanish to Basque. We have performed innovative work on the following tasks: construction of a dependency analyser for Spanish, use of rich linguistic information to translate prepositions and syntactic functions (such as subject and object markers), construction of an efficient module for verbal chunk transfer, and design and implementation of modules for ordering words and phrases, independently of the source language.

Journal of Biomedical Informatics | 2015

On the creation of a clinical gold standard corpus in Spanish

Maite Oronoz; Koldo Gojenola; Alicia Pérez; Arantza Díaz de Ilarraza; Arantza Casillas

The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning.

intelligent tutoring systems | 2004

From Human to Automatic Summary Evaluation

Iraide Zipitria; Jon A. Elorriaga; Ana Arruarte; Arantza Díaz de Ilarraza

One of the goals remaining in Intelligent Tutoring Systems is to create applications to evaluate open-ended text in a human-like manner. The aim of this study is to produce the design for a fully automatic summary evaluation system that could stand for human-like summarisation assessment. In order to gain this goal, an empirical study has been carried out to identify underlying cognitive processes. The studied sample is compound by 15 expert raters on summary evaluation with different professional backgrounds in education. Pearson’s correlation has been calculated to see inter-rater agreement level and stepwise linear regression to observe predicting variables and weights. In addition, interviews with subjects provided qualitative information that could not be acquired numerically. Based on this research, a design of a fully automatic summary evaluation environment has been described.

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Dealing With Complex Linguistic Annotations Within a Language Processing Framework

Xabier Artola; Arantza Díaz de Ilarraza; Aitor Soroa; Aitor Sologaistoa

In this paper we present AWA, a general purpose Annotation Web Architecture for representing, storing, and accessing the information produced by different linguistic processors. The objective of AWA is to establish a coherent and flexible representation scheme that will be the basis for the exchange and use of linguistic information. In morphologically-rich languages as Basque it is necessary to represent and provide easy access to complex phenomena such as intraword structure, declension, derivation and composition features, constituent discontinousness (in multiword expressions) and so on. AWA provides a well-suited schema to deal with these phenomena. The annotation model relies on XML technologies for data representation, storage and retrieval. Typed feature structures are used as a representation schema for linguistic analyses. A consistent underlying data model, which captures the structure and relations contained in the information to be manipulated, has been identified and implemented. AWA is integrated into LPAF, a multilayered Language Processing and Annotation Framework, whose goal is the management and integration of diverse NLP components and resources. Moreover, we introduce EULIA, an annotation tool which exploits and manipulates the data created by the linguistic processors. Two real corpora have been processed and annotated within this framework.

finite-state methods and natural language processing | 2005

An FST grammar for verb chain transfer in a spanish-basque MT system

Iñaki Alegria; Arantza Díaz de Ilarraza; Gorka Labaka; Mikel Lersundi; Aingeru Mayor; Kepa Sarasola

We are developing an Spanish-Basque MT system using the traditional transfer model and based on shallow and dependency parsing. The project is based on the previous work of our group but integrated in OpenTrad initiative [2]. This abstract sumarizes the current status of development of an FST grammar for the structural transfer of verb chains. This task is quite complex due to the high distance between both languages. In the actual implementation we are using XRCE Finite States Tools [1].

international conference on computational linguistics | 2005

Design and development of a system for the detection of agreement errors in basque

Arantza Díaz de Ilarraza; Koldo Gojenola; Maite Oronoz

This paper presents the design and development of a system for the detection and correction of syntactic errors in free texts. The system is composed of three main modules: a) a robust syntactic analyser, b) a compiler that will translate error processing rules, and c) a module that coordinates the results of the analyser, applying different combinations of the already compiled error rules. The use of the syntactic analyser (a) and the rule processor (b) is independent and not necessarily sequential. The specification language used for the description of the error detection/correction rules is abstract, general, declarative, and based on linguistic information.

international conference on computational linguistics | 2002

Semiautomatic labelling of semantic features

Arantza Díaz de Ilarraza; Aingeru Mayor; Kepa Sarasola

This paper presents the strategy and design of a highly efficient semiautomatic method for labelling the semantic features of common nouns, using semantic relationships between words, and based on the information extracted from an electronic monolingual dictionary. The method, that uses genus data, specific relators and synonymy information, obtains an accuracy of over 99% and a scope of 68,2% with regard to all the common nouns contained in a real corpus of over 1 million words, after the manual labelling of only 100 nouns.

Corpus Linguistics and Linguistic Theory | 2015

Establishing criteria for RST-based discourse segmentation and annotation for texts in Basque

Mikel Iruskieta; Arantza Díaz de Ilarraza; Mikel Lersundi

Abstract This article presents a discourse annotation methodology based on Rhetorical Structure Theory and an empirical study of annotating a corpus of specialized medical texts in Basque. The annotation process includes two phases: segmentation and annotation of rhetorical relations. Phase one entails an initial study which leads to establishing linguistic criteria for sentence-based segmentation; a second phase focuses on annotation of rhetorical relations. After establishing discourse segments and rhetorical relations, the annotation process is analyzed and evaluated by means of the method commonly used in RST (Marcu 2000). Inconsistencies detected in the evaluation method lead the authors to redefine some criteria of the evaluation method. As a result of this work, a small annotated Basque-language corpus is provided to scientific community.

Explore More