Is this you? Create Your Porfile

Aitor Soroa

University of the Basque Country

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aitor Soroa is active.

Explore More

Publication

Featured researches published by Aitor Soroa.

north american chapter of the association for computational linguistics | 2009

A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches

Eneko Agirre; Enrique Alfonseca; Keith B. Hall; Jana Kravalova; Marius Pasca; Aitor Soroa

This paper presents and compares WordNet-based and distributional similarity approaches. The strengths and weaknesses of each approach regarding similarity and relatedness tasks are discussed, and a combination is presented. Each of our methods independently provide the best results in their class on the RG and WordSim353 datasets, and a supervised combination of them yields the best published results on all datasets. Finally, we pioneer cross-lingual similarity, showing that our methods are easily adapted for a cross-lingual task with minor losses.

meeting of the association for computational linguistics | 2009

Personalizing PageRank for Word Sense Disambiguation

Eneko Agirre; Aitor Soroa

In this paper we propose a new graph-based method that uses the knowledge in a LKB (based on WordNet) in order to perform unsupervised Word Sense Disambiguation. Our algorithm uses the full graph of the LKB efficiently, performing better than previous approaches in English all-words datasets. We also show that the algorithm can be easily ported to other languages with good results, with the only requirement of having a wordnet. In addition, we make an analysis of the performance of the algorithm, showing that it is efficient and that it could be tuned to be faster.

meeting of the association for computational linguistics | 2007

SemEval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems

Eneko Agirre; Aitor Soroa

The goal of this task is to allow for comparison across sense-induction and discrimination systems, and also to compare these systems to other supervised and knowledge-based systems. In total there were 6 participating systems. We reused the SemEval-2007 English lexical sample subtask of task 17, and set up both clustering-style unsupervised evaluation (using OntoNotes senses as gold-standard) and a supervised evaluation (using the part of the dataset for mapping). We provide a comparison to the results of the systems participating in the lexical sample subtask of task 17.

graph based methods for natural language processing | 2009

WikiWalk: Random walks on Wikipedia for Semantic Relatedness

Eric Yeh; Daniel Ramage; Christopher D. Manning; Eneko Agirre; Aitor Soroa

Computing semantic relatedness of natural language texts is a key component of tasks such as information retrieval and summarization, and often depends on knowledge of a broad range of real-world concepts and relationships. We address this knowledge integration issue by computing semantic relatedness using personalized PageRank (random walks) on a graph derived from Wikipedia. This paper evaluates methods for building the graph, including link selection strategies, and two methods for representing input texts as distributions over the graph nodes: one based on a dictionary lookup, the other based on Explicit Semantic Analysis. We evaluate our techniques on standard word relatedness and text similarity datasets, finding that they capture similarity information complementary to existing Wikipedia-based relatedness measures, resulting in small improvements on a state-of-the-art measure.

Computational Linguistics | 2014

Random walks for knowledge-based word sense disambiguation

Eneko Agirre; Oier Lopez de Lacalle; Aitor Soroa

Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a word in context. In this article we present a WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB). We show that our algorithm performs better than other graph-based methods when run on a graph built from WordNet and eXtended WordNet. Our algorithm and LKB combination compares favorably to other knowledge-based approaches in the literature that use similar knowledge on a variety of English data sets and a data set on Spanish. We include a detailed analysis of the factors that affect the algorithm. The algorithm and the LKBs used are publicly available, and the results easily reproducible.

empirical methods in natural language processing | 2006

Two graph-based algorithms for state-of-the-art WSD

Eneko Agirre; David Martinez; Oier Lopez de Lacalle; Aitor Soroa

This paper explores the use of two graph algorithms for unsupervised induction and tagging of nominal word senses based on corpora. Our main contribution is the optimization of the free parameters of those algorithms and its evaluation against publicly available gold standards. We present a thorough evaluation comprising supervised and unsupervised modes, and both lexical-sample and all-words tasks. The results show that, in spite of the information loss inherent to mapping the induced senses to the gold-standard, the optimization of parameters based on a small sample of nouns carries over to all nouns, performing close to supervised systems in the lexical sample task and yielding the second-best WSD systems for the Senseval-3 all-words task.

Bioinformatics | 2010

Graph-based Word Sense Disambiguation of biomedical documents

Eneko Agirre; Aitor Soroa; Mark Stevenson

MOTIVATION Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This article presents a graph-based approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use of knowledge from the Unified Medical Language System (UMLS) Metathesaurus which is represented as a graph. A state-of-the-art algorithm, Personalized PageRank, is used to perform WSD. RESULTS When evaluated on the NLM-WSD dataset, the algorithm outperforms other methods that rely on the UMLS Metathesaurus alone. AVAILABILITY The WSD system is open source licensed and available from http://ixa2.si.ehu.es/ukb/. The UMLS, MetaMap program and NLM-WSD corpus are available from the National Library of Medicine https://www.nlm.nih.gov/research/umls/, http://mmtx.nlm.nih.gov and http://wsd.nlm.nih.gov. Software to convert the NLM-WSD corpus into a format that can be used by our WSD system is available from http://www.dcs.shef.ac.uk/∼marks/biomedical_wsd under open source license.

conference on information and knowledge management | 2011

Two birds with one stone: learning semantic models for text categorization and word sense disambiguation

Roberto Navigli; Stefano Faralli; Aitor Soroa; Oier Lopez de Lacalle; Eneko Agirre

In this paper we present a novel approach to learning semantic models for multiple domains, which we use to categorize Wikipedia pages and to perform domain Word Sense Disambiguation (WSD). In order to learn a semantic model for each domain we first extract relevant terms from the texts in the domain and then use these terms to initialize a random walk over the WordNet graph. Given an input text, we check the semantic models, choose the appropriate domain for that text and use the best-matching model to perform WSD. Our results show considerable improvements on text categorization and domain WSD tasks.

Journal of Biomedical Informatics | 2014

Improving search over Electronic Health Records using UMLS-based query expansion through random walks

David Martinez; Arantxa Otegi; Aitor Soroa; Eneko Agirre

OBJECTIVE Most of the information in Electronic Health Records (EHRs) is represented in free textual form. Practitioners searching EHRs need to phrase their queries carefully, as the record might use synonyms or other related words. In this paper we show that an automatic query expansion method based on the Unified Medicine Language System (UMLS) Metathesaurus improves the results of a robust baseline when searching EHRs. MATERIALS AND METHODS The method uses a graph representation of the lexical units, concepts and relations in the UMLS Metathesaurus. It is based on random walks over the graph, which start on the query terms. Random walks are a well-studied discipline in both Web and Knowledge Base datasets. RESULTS Our experiments over the TREC Medical Record track show improvements in both the 2011 and 2012 datasets over a strong baseline. DISCUSSION Our analysis shows that the success of our method is due to the automatic expansion of the query with extra terms, even when they are not directly related in the UMLS Metathesaurus. The terms added in the expansion go beyond simple synonyms, and also add other kinds of topically related terms. CONCLUSIONS Expansion of queries using related terms in the UMLS Metathesaurus beyond synonymy is an effective way to overcome the gap between query and document vocabularies when searching for patient cohorts.

north american chapter of the association for computational linguistics | 2015

Random Walks and Neural Network Language Models on Knowledge Bases

Josu Goikoetxea; Aitor Soroa; Eneko Agirre

Random walks over large knowledge bases like WordNet have been successfully used in word similarity, relatedness and disambiguation tasks. Unfortunately, those algorithms are relatively slow for large repositories, with significant memory footprints. In this paper we present a novel algorithm which encodes the structure of a knowledge base in a continuous vector space, combining random walks and neural net language models in order to produce novel word representations. Evaluation in word relatedness and similarity datasets yields equal or better results than those of a random walk algorithm, using a dense representation (300 dimensions instead of 117K). Furthermore, the word representations are complementary to those of the random walk algorithm and to corpus-based continuous representations, improving the stateof-the-art in the similarity dataset. Our technique opens up exciting opportunities to combine distributional and knowledge-based word representations.

Explore More