Nerea Ezeiza
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nerea Ezeiza.
MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing | 2004
Iñaki Alegria; Olatz Ansa; Xabier Artola; Nerea Ezeiza; Koldo Gojenola; Ruben Urizar
This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a general-purpose lexical database. Due to its expressive power, the schema can deal not only with fixed expressions but also with morphosyntactically flexible constructions. It also allows us to lemmatize word combinations as a unit and yet to parse the components individually if necessary. Moreover, we describe HABIL, a tool for the automatic processing of these expressions, and we give some evaluation results. This work must be placed in a general framework of written Basque processing tools, which currently ranges from the tokenization and segmentation of single words up to the syntactic tagging of general texts.
iberoamerican congress on pattern recognition | 2003
K. López de Ipiña; Manuel Graña; Nerea Ezeiza; M. Hernández; Ekaitz Zulueta; Aitzol Ezeiza; C. Tovar
The selection of appropriate Lexical Units (LUs) is an important issue in the development of Continuous Speech Recognition (CSR) systems. Words have been used classically as the recognition unit in most of them. However, proposals of non-word units are beginning to arise. Basque is an agglutinative language with some structure inside words, for which non-word morpheme like units could be an appropriate choice. In this work a statistical analysis of units obtained after morphological segmentation has been carried out. This analysis shows a potential gain of confusion rates in CSR systems, due to the growth of the set of acoustically similar and short morphemes. Thus, several proposals of Lexical Units are analysed to deal with the problem. Measures of Phonetic Perplexity and Speech Recognition rates have been computed using different sets of units and, based on these measures, a set of alternative non-word units have been selected.
international conference on implementation and application of automata | 2001
Iñaki Alegria; Maxux J. Aranzabe; Nerea Ezeiza; Aitzol Ezeiza; Ruben Urizar
This paper describes the components used in the design and implementation of NLP tools for Basque. These components are based on finite state technology and are devoted to the morphological analysis of Basque, an agglutinative pre-Indo-European language. We think that our design can be interesting for the treatment of other languages. The main components developed are a general and robust morphological analyser/generator and a spelling checker/corrector for Basque named Xuxen. The analyser is a basic tool for current and future work on NLP of Basque, such as the lemmatiser/tagger Euslem, an Intranet search engine or an assistant for verse-making.
conference of the european chapter of the association for computational linguistics | 1993
Itziar Aduriz; Eneko Agirre; Iñaki Alegria; Xabier Arregi; Jose Mari Arriola; Xabier Artola; A. Díaz de Ilarraza; Nerea Ezeiza; Montse Maritxalar; Kepa Sarasola; Miriam Urkia
Xuxen is a spelling checker/corrector for Basque which is going to be comercialized next year. The checker recognizes a word-form if a correct morphological breakdown is allowed. The morphological analysis is based on two-level morphology. The correction method distinguishes between orthographic errors and typographical errors. • Typographical errors (or misstypings) are uncognitive errors which do not follow linguistic criteria. • Orthographic errors are cognitive errors which occur when the writer does not know or has forgotten the correct spelling for a word. They are more persistent because of their cognitive nature, they leave worse impression and, finally, its treatment is an interesting application for language standardization purposes.
text speech and dialogue | 2011
Izaskun Fernández; Iñaki Alegria; Nerea Ezeiza
Resolving Named Entity Disambiguation task with a small knowledge base makes the task more challenging. Concretely, we present an evaluation of the state-of-the-art methods in this task for Basque NE disambiguation based on the Basque Wikipedia. We have used MFS, VSM, ESA and UKB for linking any ambiguous surface NE form occurrence in a text with its corresponding Wikipedia entry in the Basque Wikipedia version. We have analysed their performance with different corpora and as it was expected, most of them perform worse than when using big Wikipedias such as the English version, but we think these results are more realistic for less-resourced languages. We propose a new normalization factor for ESA to minimise the effect of the knowledge base size.
Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. | 2002
K. López de Ipiña; Nerea Ezeiza; Germán Bordel; Manuel Graña
Morphological information is traditionally used to develop high quality text to speech (TTS) and automatic speech recognition (ASR) systems. The use of this information improves the naturalness and intelligibility of the TTS synthesis and provides an appropriated way to select lexical units (LU) for ASR. Basque is an agglutinative language with a complex structure inside the words and the morphological information is essential both in TTS and ASR. In this work an automatic morphological segmentation tool oriented to TTS and ASR tasks is presented.
iberoamerican congress on pattern recognition | 2003
K. López de Ipiña; Manuel Graña; Nerea Ezeiza; M. Hernández; Ekaitz Zulueta; Aitzol Ezeiza
This paper presents a new methodology, based on the classical decision trees, to get a suitable set of context dependent sublexical units for Basque Continuous Speech Recognition (CSR). The original method proposed by Bahl [1] was applied as the benchmark. Then two new features were added: a data massaging to emphasise the data and a fast and efficient Growing and Pruning algorithm for DT construction. In addition, the use of the new context dependent units to build word models was addressed. The benchmark Bahl approach gave recognition rates clearly outperforming those of context independent phone-like units. Finally the new methodology improves over the benchmark DT approach.
text speech and dialogue | 2016
Arantxa Otegi; Nerea Ezeiza; Iakes Goenaga; Gorka Labaka
This work describes the initial stage of designing and implementing a modular chain of Natural Language Processing tools for Basque. The main characteristic of this chain is the deep morphosyntactic analysis carried out by the first tool of the chain and the use of these morphologically rich annotations by the following linguistic processing tools of the chain. It is designed following a modular approach, showing high ease of use of its processors. Two tools have been adapted and integrated to the chain so far, and are ready to use and freely available, namely the morphosyntactic analyzer and PoS tagger, and the dependency parser. We have evaluated these tools and obtained competitive results. Furthermore, we have tested the robustness of the tools on an extensive processing of Basque documents in various research projects.
meeting of the association for computational linguistics | 1998
Nerea Ezeiza; Iñaki Alegria; Jose Maria Arriola; Ruben Urizar; Itziar Aduriz
Proceedings of the 7th EURALEX International Congress | 1996
Itziar Aduriz; Izaskun Aldezabal; Iñaki Alegria; Xabier Artola; Nerea Ezeiza; Ruben Urizar