Xabier Artola
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xabier Artola.
MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing | 2004
Iñaki Alegria; Olatz Ansa; Xabier Artola; Nerea Ezeiza; Koldo Gojenola; Ruben Urizar
This paper describes the representation of Basque Multiword Lexical Units and the automatic processing of Multiword Expressions. After discussing and stating which kind of multiword expressions we consider to be processed at the current stage of the work, we present the representation schema of the corresponding lexical units in a general-purpose lexical database. Due to its expressive power, the schema can deal not only with fixed expressions but also with morphosyntactically flexible constructions. It also allows us to lemmatize word combinations as a unit and yet to parse the components individually if necessary. Moreover, we describe HABIL, a tool for the automatic processing of these expressions, and we give some evaluation results. This work must be placed in a general framework of written Basque processing tools, which currently ranges from the tokenization and segmentation of single words up to the syntactic tagging of general texts.
international conference on computational linguistics | 2000
Itziar Aduriz; Eneko Agirre; Izaskun Aldezabal; Iñaki Alegria; Xabier Arregi; Jose Mari Arriola; Xabier Artola; Koldo Gojenola; A. Maritxalar; Kepa Sarasola; Miriam Urkia
Agglutinative languages present rich morphology and for some applications they need deep analysis at word level. The work here presented proposes a model for designing a full morphological analyzer.The model integrates the two-level formalism and a unification-based formalism. In contrast to other works, we propose to separate the treatment of sequential and non-sequential morphotactic constraints. Sequential constraints are applied in the segmentation phase, and non-sequential ones in the final feature-combination phase. Early application of sequential morphotactic constraints during the segmentation process makes feasible an efficient implementation of the full morphological analyzer.The result of this research has been the design and implementation of a full morphosyntactic analysis procedure for each word in unrestricted Basque texts.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Xabier Artola; Arantza Díaz de Ilarraza; Aitor Soroa; Aitor Sologaistoa
In this paper we present AWA, a general purpose Annotation Web Architecture for representing, storing, and accessing the information produced by different linguistic processors. The objective of AWA is to establish a coherent and flexible representation scheme that will be the basis for the exchange and use of linguistic information. In morphologically-rich languages as Basque it is necessary to represent and provide easy access to complex phenomena such as intraword structure, declension, derivation and composition features, constituent discontinousness (in multiword expressions) and so on. AWA provides a well-suited schema to deal with these phenomena. The annotation model relies on XML technologies for data representation, storage and retrieval. Typed feature structures are used as a representation schema for linguistic analyses. A consistent underlying data model, which captures the structure and relations contained in the information to be manipulated, has been identified and implemented. AWA is integrated into LPAF, a multilayered Language Processing and Annotation Framework, whose goal is the management and integration of diverse NLP components and resources. Moreover, we introduce EULIA, an annotation tool which exploits and manipulates the data created by the linguistic processors. Two real corpora have been processed and annotated within this framework.
conference of the european chapter of the association for computational linguistics | 1993
Itziar Aduriz; Eneko Agirre; Iñaki Alegria; Xabier Arregi; Jose Mari Arriola; Xabier Artola; A. Díaz de Ilarraza; Nerea Ezeiza; Montse Maritxalar; Kepa Sarasola; Miriam Urkia
Xuxen is a spelling checker/corrector for Basque which is going to be comercialized next year. The checker recognizes a word-form if a correct morphological breakdown is allowed. The morphological analysis is based on two-level morphology. The correction method distinguishes between orthographic errors and typographical errors. • Typographical errors (or misstypings) are uncognitive errors which do not follow linguistic criteria. • Orthographic errors are cognitive errors which occur when the writer does not know or has forgotten the correct spelling for a word. They are more persistent because of their cognitive nature, they leave worse impression and, finally, its treatment is an interesting application for language standardization purposes.
Natural Language Engineering | 1996
Eneko Agirre; Xabier Arregi; Xabier Artola; A. Díaz de Ilarraza; Kepa Sarasola; Aitor Soroa
This paper discusses different issues in the construction and knowledge representation of an intelligent dictionary help system. The Intelligent Dictionary Help System (IDHS) is conceived as a monolingual (explanatory) dictionary system for human use (Artola and Evrard, 1992). The fact that it is intended for people instead of automatic processing distinguishes it from other systems dealing with the acquisition of semantic knowledge from conventional dictionaries. The system provides various access possibilities to the data, allowing to deduce implicit knowledge from the explicit dictionary information. IDHS deals with reasoning mechanisms analogous to those used by humans when they consult a dictionary. User level functionality of the system has been specified and a prototype has been implemented (Agirre et al., 1994a). A methodology for the extraction of semantic knowledge from a conventional dictionary is described. The method followed in the construction of the phrasal pattern hierarchies required by the parser (Alshawi, 1989) is based on an empirical study carried out on the structure of definition sentences. The results of its application to a real dictionary has shown that the parsing method is particularly suited to the analysis of short definition sentences, as it was the case of the source dictionary. As a result of this process, the characterization of the different lexical-semantic relations between senses is established by means of semantic rules (attached to the patterns); these rules are used for the initial construction of the Dictionary Knowledge Base (DKB). The representation schema proposed for the DKB (Agirre et al., 1994b) is basically a semantic network of frames representing word senses. After construction of the initial DKB, several enrichment processes are performed on the DKB to add new facts to it; these processes are based on the exploitation of the properties of lexical-semantic relations, and also on specially conceived deduction mechanisms. The result of the enrichment processes show the suitability of the representation schema chosen to deduce implicit knowledge. Erroneous deductions are mainly due to incorrect word sense disambiguation.
Natural Language Engineering | 1999
Eneko Agirre; Xabier Arregi; Xabier Artola; A. Díaz de Ilarraza; Kepa Sarasola; Aitor Soroa
This paper focuses on the design methodology of the MultiLingual Dictionary-System (MLDS), which is a human-oriented tool for assisting in the task of translating lexical units, oriented to translators and conceived from studies carried out with translators. We describe the model adopted for the representation of multilingual dictionary-knowledge. Such a model allows an enriched exploitation of the lexical-semantic relations extracted from dictionaries. In addition, MLDS is supplied with knowledge about the use of the dictionaries in the process of lexical translation, which was elicitated by means of empirical methods and specified in a formal language. The dictionary-knowledge along with the task-oriented knowledge are used to offer the translator active, anticipative and intelligent assistance.
language and technology conference | 2009
Iñaki Alegria; Maxux J. Aranzabe; Xabier Arregi; Xabier Artola; Arantza Díaz de Ilarraza; Aingeru Mayor; Kepa Sarasola
We present some Language Technology applications and resources that have proven to be valuable tools to promote the use of Basque, a low density language. We also present the strategy we have followed for almost twenty years to develop those tools and derived applications as the top of an integrated environment of language resources, language tools and other applications. In our opinion, if Basque is now in a quite good position in Language Technology is because those guidelines have been followed.
Natural Language Engineering | 2008
Xabier Artola; Aitor Soroa
The design and construction of lexical resources is a critical issue in Natural Language Processing (NLP). Real-world NLP systems need large-scale lexica, which provide rich information about words and word senses at all levels: morphologic, syntactic, lexical semantics, etc., but the construction of lexical resources is a difficult and costly task. The last decade has been highly influenced by the notion of reusability, that is, the use of the information of existing lexical resources in constructing new ones. It is unrealistic, however, to expect that the great variety of available lexical information resources could be converted into a single and standard representation schema in the near future. The purpose of this article is to present the ELHISA system, a software architecture for the integration of heterogeneous lexical information. We address, from the point of view of the information integration area, the problem of querying very different existing lexical information sources using a unique and common query language. The integration in ELHISA is performed in a logical way, so that the lexical resources do not suffer any modification when integrating them into the system. ELHISA is primarily defined as a consultation system for accessing structured lexical information, and therefore it does not have the capability to modify or update the underlying information. For this purpose, a General Conceptual Model (GCM) for describing diverse lexical data has been conceived. The GCM establishes a fixed vocabulary describing objects in the lexical information domain, their attributes, and the relationships among them. To integrate the lexical resources into the federation, a Source Conceptual Model (SCM) is built on the top of each one, which represents the lexical objects concurring in each particular source. To answer the user queries, ELHISA must access the integrated resources, and, hence, it must translate the query expressed in GCM terms into queries formulated in terms of the SCM of each source. The relation between the GCM and the SCMs is explicitly described by means of mapping rules called Content Description Rules. Data integration at the extensional level is achieved by means of the data cleansing process, needed if we want to compare the data arriving from different sources. In this process, the object identification step is carried out. Based on this architecture, a prototype named ELHISA has been built, and five resources covering a broad scope have been integrated into it so far for testing purposes. The fact that such heterogeneous resources have been integrated with ease into the system shows, in the opinion of the authors, the suitability of the approach taken.
Literary and Linguistic Computing | 1996
Iñaki Alegria; Xabier Artola; Kepa Sarasola; Miriam Urkia
Knowledge Based Systems | 2015
Rodrigo Agerri; Xabier Artola; Zuhaitz Beloki; German Rigau; Aitor Soroa