Anna Korhonen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anna Korhonen is active.

Explore More

Publication

Featured researches published by Anna Korhonen.

Computational Linguistics | 2015

Simlex-999: Evaluating semantic models with genuine similarity estimation

Felix Hill; Roi Reichart; Anna Korhonen

We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar (Freud, psychology) have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider, range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun, and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures.

International Journal of Medical Informatics | 2006

Zone analysis in biology articles as a basis for information extraction

Yoko Mizuta; Anna Korhonen; Tony Mullen; Nigel Collier

In the field of biomedicine, an overwhelming amount of experimental data has become available as a result of the high throughput of research in this domain. The amount of results reported has now grown beyond the limits of what can be managed by manual means. This makes it increasingly difficult for the researchers in this area to keep up with the latest developments. Information extraction (IE) in the biological domain aims to provide an effective automatic means to dynamically manage the information contained in archived journal articles and abstract collections and thus help researchers in their work. However, while considerable advances have been made in certain areas of IE, pinpointing and organizing factual information (such as experimental results) remains a challenge. In this paper we propose tackling this task by incorporating into IE information about rhetorical zones, i.e. classification of spans of text in terms of argumentation and intellectual attribution. As the first step towards this goal, we introduce a scheme for annotating biological texts for rhetorical zones and provide a qualitative and quantitative analysis of the data annotated according to this scheme. We also discuss our preliminary research on automatic zone analysis, and its incorporation into our IE framework.

north american chapter of the association for computational linguistics | 2016

Learning distributed representations of sentences from unlabelled data

Felix Hill; Kyunghyun Cho; Anna Korhonen

Unsupervised methods for learning distributed representations of words are ubiquitous in todays NLP research, but far less is known about the best ways to learn distributed phrase or sentence representations from unlabelled data. This paper is a systematic comparison of models that learn such representations. We find that the optimal approach depends critically on the intended application. Deeper, more complex models are preferable for representations to be used in supervised systems, but shallow log-linear models work best for building representation spaces that can be decoded with simple spatial distance metrics. We also propose two new unsupervised representation-learning objectives designed to optimise the trade-off between training time, domain portability and performance.

meeting of the association for computational linguistics | 2003

Clustering Polysemic Subcategorization Frame Distributions Semantically

Anna Korhonen; Yuval Krymolowski; Zvika Marx

Previous research has demonstrated the utility of clustering in inducing semantic verb classes from undisambiguated corpus data. We describe a new approach which involves clustering subcategorization frame (SCF) distributions using the Information Bottleneck and nearest neighbour methods. In contrast to previous work, we particularly focus on clustering polysemic verbs. A novel evaluation scheme is proposed which accounts for the effect of polysemy on the clusters, offering us a good insight into the potential and limitations of semantically classifying undisambiguated SCF data.

empirical methods in natural language processing | 2009

Improving Verb Clustering with Automatically Acquired Selectional Preferences

Lin Sun; Anna Korhonen

In previous research in automatic verb classification, syntactic features have proved the most useful features, although manual classifications rely heavily on semantic features. We show, in contrast with previous work, that considerable additional improvement can be obtained by using semantic features in automatic classification: verb selectional preferences acquired from corpus data using a fully unsupervised method. We report these promising results using a new framework for verb clustering which incorporates a recent subcategorization acquisition system, rich syntactic-semantic feature sets, and a variation of spectral clustering which performs particularly well in high dimensional feature space.

Computational Linguistics | 2013

Statistical metaphor processing

Ekaterina Shutova; Simone Teufel; Anna Korhonen

Metaphor is highly frequent in language, which makes its computational processing indispensable for real-world NLP applications addressing semantic tasks. Previous approaches to metaphor modeling rely on task-specific hand-coded knowledge and operate on a limited domain or a subset of phenomena. We present the first integrated open-domain statistical model of metaphor processing in unrestricted text. Our method first identifies metaphorical expressions in running text and then paraphrases them with their literal paraphrases. Such a text-to-text model of metaphor interpretation is compatible with other NLP applications that can benefit from metaphor resolution. Our approach is minimally supervised, relies on the state-of-the-art parsing and lexical acquisition technologies (distributional clustering and selectional preference induction), and operates with a high accuracy.

Proceedings of the Workshop on Geometrical Models of Natural Language Semantics | 2009

Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering

Andreas Vlachos; Anna Korhonen; Zoubin Ghahramani

In this work, we apply Dirichlet Process Mixture Models (DPMMs) to a learning task in natural language processing (NLP): lexical-semantic verb clustering. We thoroughly evaluate a method of guiding DPMMs towards a particular clustering solution using pairwise constraints. The quantitative and qualitative evaluation performed highlights the benefits of both standard and constrained DPMMs compared to previously used approaches. In addition, it sheds light on the use of evaluation measures and their practical application.

north american chapter of the association for computational linguistics | 2004

Extended lexical-semantic classification of English verbs

Anna Korhonen; Ted Briscoe

Lexical-semantic verb classifications have proved useful in supporting various natural language processing (NLP) tasks. The largest and the most widely deployed classification in English is Levins (1993) taxonomy of verbs and their classes. While this resource is attractive in being extensive enough for some NLP use, it is not comprehensive. In this paper, we present a substantial extension to Levins taxonomy which incorporates 57 novel classes for verbs not covered (comprehensively) by Levin. We also introduce 106 novel diathesis alternations, created as a side product of constructing the new classes. We demonstrate the utility of our novel classes by using them to support automatic subcategorization acquisition and show that the resulting extended classification has extensive coverage over the English verb lexicon.

meeting of the association for computational linguistics | 2002

Semantically Motivated Subcategorization Acquisition

Anna Korhonen

Automatic acquisition of subcategorization lexicons from textual corpora has become increasingly popular. Although this work has met with some success, resulting lexicons indicate a need for greater accuracy. One significant source of error lies in the process of hypothesis selection which is used for removing noise from automatically acquired subcategorization frames (SCFs). In this paper we describe a more accurate semantically-driven approach to hypothesis selection which can be used to improve large-scale SCF acquisition.

empirical methods in natural language processing | 2000

Statistical Filtering and Subcategorization Frame Acquisition

Anna Korhonen; Genevieve Gorrell; Diana McCarthy

Research into the automatic acquisition of subcategorization frames (SCFs) from corpora is starting to produce large-scale computational lexicons which include valuable frequency information. However, the accuracy of the resulting lexicons shows room for improvement. One significant source of error lies in the statistical filtering used by some researchers to remove noise from automatically acquired subcategorization frames. In this paper, we compare three different approaches to filtering out spurious hypotheses. Two hypothesis tests perform poorly, compared to filtering frames on the basis of relative frequency. We discuss reasons for this and consider directions for future research.

Explore More