Aitor Soroa
University of the Basque Country
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aitor Soroa.
north american chapter of the association for computational linguistics | 2009
Eneko Agirre; Enrique Alfonseca; Keith B. Hall; Jana Kravalova; Marius Pasca; Aitor Soroa
This paper presents and compares WordNet-based and distributional similarity approaches. The strengths and weaknesses of each approach regarding similarity and relatedness tasks are discussed, and a combination is presented. Each of our methods independently provide the best results in their class on the RG and WordSim353 datasets, and a supervised combination of them yields the best published results on all datasets. Finally, we pioneer cross-lingual similarity, showing that our methods are easily adapted for a cross-lingual task with minor losses.
meeting of the association for computational linguistics | 2009
Eneko Agirre; Aitor Soroa
In this paper we propose a new graph-based method that uses the knowledge in a LKB (based on WordNet) in order to perform unsupervised Word Sense Disambiguation. Our algorithm uses the full graph of the LKB efficiently, performing better than previous approaches in English all-words datasets. We also show that the algorithm can be easily ported to other languages with good results, with the only requirement of having a wordnet. In addition, we make an analysis of the performance of the algorithm, showing that it is efficient and that it could be tuned to be faster.
meeting of the association for computational linguistics | 2007
Eneko Agirre; Aitor Soroa
The goal of this task is to allow for comparison across sense-induction and discrimination systems, and also to compare these systems to other supervised and knowledge-based systems. In total there were 6 participating systems. We reused the SemEval-2007 English lexical sample subtask of task 17, and set up both clustering-style unsupervised evaluation (using OntoNotes senses as gold-standard) and a supervised evaluation (using the part of the dataset for mapping). We provide a comparison to the results of the systems participating in the lexical sample subtask of task 17.
graph based methods for natural language processing | 2009
Eric Yeh; Daniel Ramage; Christopher D. Manning; Eneko Agirre; Aitor Soroa
Computing semantic relatedness of natural language texts is a key component of tasks such as information retrieval and summarization, and often depends on knowledge of a broad range of real-world concepts and relationships. We address this knowledge integration issue by computing semantic relatedness using personalized PageRank (random walks) on a graph derived from Wikipedia. This paper evaluates methods for building the graph, including link selection strategies, and two methods for representing input texts as distributions over the graph nodes: one based on a dictionary lookup, the other based on Explicit Semantic Analysis. We evaluate our techniques on standard word relatedness and text similarity datasets, finding that they capture similarity information complementary to existing Wikipedia-based relatedness measures, resulting in small improvements on a state-of-the-art measure.
Computational Linguistics | 2014
Eneko Agirre; Oier Lopez de Lacalle; Aitor Soroa
Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a word in context. In this article we present a WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB). We show that our algorithm performs better than other graph-based methods when run on a graph built from WordNet and eXtended WordNet. Our algorithm and LKB combination compares favorably to other knowledge-based approaches in the literature that use similar knowledge on a variety of English data sets and a data set on Spanish. We include a detailed analysis of the factors that affect the algorithm. The algorithm and the LKBs used are publicly available, and the results easily reproducible.
empirical methods in natural language processing | 2006
Eneko Agirre; David Martinez; Oier Lopez de Lacalle; Aitor Soroa
This paper explores the use of two graph algorithms for unsupervised induction and tagging of nominal word senses based on corpora. Our main contribution is the optimization of the free parameters of those algorithms and its evaluation against publicly available gold standards. We present a thorough evaluation comprising supervised and unsupervised modes, and both lexical-sample and all-words tasks. The results show that, in spite of the information loss inherent to mapping the induced senses to the gold-standard, the optimization of parameters based on a small sample of nouns carries over to all nouns, performing close to supervised systems in the lexical sample task and yielding the second-best WSD systems for the Senseval-3 all-words task.
Bioinformatics | 2010
Eneko Agirre; Aitor Soroa; Mark Stevenson
MOTIVATION Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This article presents a graph-based approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use of knowledge from the Unified Medical Language System (UMLS) Metathesaurus which is represented as a graph. A state-of-the-art algorithm, Personalized PageRank, is used to perform WSD. RESULTS When evaluated on the NLM-WSD dataset, the algorithm outperforms other methods that rely on the UMLS Metathesaurus alone. AVAILABILITY The WSD system is open source licensed and available from http://ixa2.si.ehu.es/ukb/. The UMLS, MetaMap program and NLM-WSD corpus are available from the National Library of Medicine https://www.nlm.nih.gov/research/umls/, http://mmtx.nlm.nih.gov and http://wsd.nlm.nih.gov. Software to convert the NLM-WSD corpus into a format that can be used by our WSD system is available from http://www.dcs.shef.ac.uk/∼marks/biomedical_wsd under open source license.
conference on information and knowledge management | 2011
Roberto Navigli; Stefano Faralli; Aitor Soroa; Oier Lopez de Lacalle; Eneko Agirre
In this paper we present a novel approach to learning semantic models for multiple domains, which we use to categorize Wikipedia pages and to perform domain Word Sense Disambiguation (WSD). In order to learn a semantic model for each domain we first extract relevant terms from the texts in the domain and then use these terms to initialize a random walk over the WordNet graph. Given an input text, we check the semantic models, choose the appropriate domain for that text and use the best-matching model to perform WSD. Our results show considerable improvements on text categorization and domain WSD tasks.
Journal of Biomedical Informatics | 2014
David Martinez; Arantxa Otegi; Aitor Soroa; Eneko Agirre
OBJECTIVE Most of the information in Electronic Health Records (EHRs) is represented in free textual form. Practitioners searching EHRs need to phrase their queries carefully, as the record might use synonyms or other related words. In this paper we show that an automatic query expansion method based on the Unified Medicine Language System (UMLS) Metathesaurus improves the results of a robust baseline when searching EHRs. MATERIALS AND METHODS The method uses a graph representation of the lexical units, concepts and relations in the UMLS Metathesaurus. It is based on random walks over the graph, which start on the query terms. Random walks are a well-studied discipline in both Web and Knowledge Base datasets. RESULTS Our experiments over the TREC Medical Record track show improvements in both the 2011 and 2012 datasets over a strong baseline. DISCUSSION Our analysis shows that the success of our method is due to the automatic expansion of the query with extra terms, even when they are not directly related in the UMLS Metathesaurus. The terms added in the expansion go beyond simple synonyms, and also add other kinds of topically related terms. CONCLUSIONS Expansion of queries using related terms in the UMLS Metathesaurus beyond synonymy is an effective way to overcome the gap between query and document vocabularies when searching for patient cohorts.
north american chapter of the association for computational linguistics | 2015
Josu Goikoetxea; Aitor Soroa; Eneko Agirre
Random walks over large knowledge bases like WordNet have been successfully used in word similarity, relatedness and disambiguation tasks. Unfortunately, those algorithms are relatively slow for large repositories, with significant memory footprints. In this paper we present a novel algorithm which encodes the structure of a knowledge base in a continuous vector space, combining random walks and neural net language models in order to produce novel word representations. Evaluation in word relatedness and similarity datasets yields equal or better results than those of a random walk algorithm, using a dense representation (300 dimensions instead of 117K). Furthermore, the word representations are complementary to those of the random walk algorithm and to corpus-based continuous representations, improving the stateof-the-art in the similarity dataset. Our technique opens up exciting opportunities to combine distributional and knowledge-based word representations.