Rob Koeling
University of Sussex
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rob Koeling.
meeting of the association for computational linguistics | 2004
Diana McCarthy; Rob Koeling; Julie Weeds; John A. Carroll
In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of hand-tagged data. Whilst there are a few hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words, particularly topical words, to depend on the genre and domain of the text under consideration. We present work on the use of a thesaurus acquired from raw textual corpora and the WordNet similarity package to find predominant noun senses automatically. The acquired predominant senses give a precision of 64% on the nouns of the SENSEVAL-2 English all-words task. This is a very promising result given that our method does not require any hand-tagged text, such as SemCor. Furthermore, we demonstrate that our method discovers appropriate predominant senses for words from two domain-specific corpora.
Computational Linguistics | 2007
Diana McCarthy; Rob Koeling; Julie Weeds; John A. Carroll
There has been a great deal of recent research into word sense disambiguation, particularly since the inception of the Senseval evaluation exercises. Because a word often has more than one meaning, resolving word sense ambiguity could benefit applications that need some level of semantic interpretation of language input. A major problem is that the accuracy of word sense disambiguation systems is strongly dependent on the quantity of manually sense-tagged data available, and even the best systems, when tagging every word token in a document, perform little better than a simple heuristic that guesses the first, or predominant, sense of a word in all contexts. The success of this heuristic is due to the skewed nature of word sense distributions. Data for the heuristic can come from either dictionaries or a sample of sense-tagged data. However, there is a limited supply of the latter, and the sense distributions and predominant sense of a word can depend on the domain or source of a document. (The first sense of star for example would be different in the popular press and scientific journals). In this article, we expand on a previously proposed method for determining the predominant sense of a word automatically from raw text. We look at a number of different data sources and parameterizations of the method, using evaluation results and error analyses to identify where the method performs well and also where it does not. In particular, we find that the method does not work as well for verbs and adverbs as nouns and adjectives, but produces more accurate predominant sense information than the widely used SemCor corpus for nouns with low coverage in that corpus. We further show that the method is able to adapt successfully to domains when using domain specific corpora as input and where the input can either be hand-labeled for domain or automatically classified.
Natural Language Engineering | 1999
Gertjan van Noord; Gosse Bouma; Rob Koeling; Mark-Jan Nederhof
We argue that grammatical analysis is a viable alternative to concept spotting for processing spoken input in a practical spoken dialogue system. We discuss the structure of the grammar, and a model for robust parsing which combines linguistic sources of information and statistical sources of information. We discuss test results suggesting that grammatical processing allows fast and accurate processing of spoken input.
conference on computational natural language learning | 2000
Rob Koeling
In this paper I discuss a first attempt to create a text chunker using a Maximum Entropy model. The first experiments, implementing classifiers that tag every word in a sentence with a phrase-tag using very local lexical information, part-of-speech tags and phrase tags of surrounding words, give encouraging results.
empirical methods in natural language processing | 2005
Rob Koeling; Diana McCarthy; John A. Carroll
Distributions of the senses of words are often highly skewed. This fact is exploited by word sense disambiguation (WSD) systems which back off to the predominant sense of a word when contextual clues are not strong enough. The domain of a document has a strong influence on the sense distribution of words, but it is not feasible to produce large manually annotated corpora for every domain of interest. In this paper we describe the construction of three sense annotated corpora in different domains for a sample of English words. We apply an existing method for acquiring predominant sense information automatically from raw text, and for our sample demonstrate that (1) acquiring such information automatically from a mixed-domain corpus is more accurate than deriving it from SemCor, and (2) acquiring it automatically from text in the same domain as the target domain performs best by a large margin. We also show that for an all words WSD task this automatic method is best focussed on words that are salient to the domain, and on words with a different acquired predominant sense in that domain compared to that acquired from a balanced corpus.
international conference on computational linguistics | 2000
Erik F. Tjong Kim Sang; Walter Daelemans; Hervé Déjean; Rob Koeling; Yuval Krymolowski; Vasin Punyakanok; Dan Roth
We use seven machine learning algorithms for one task: identifying base noun phrases. The results have been processed by different system combination methods and all of these outperformed the best individual result. We have applied the seven learners with the best combinator, a majority vote of the top five systems, to a standard data set and managed to improve the best published result for this data set.
meeting of the association for computational linguistics | 2007
Rob Koeling; Diana McCarthy
We introduced a method for discovering the predominant sense of words automatically using raw (unlabelled) text in (McCarthy et al., 2004) and participated with this system in Senseval3. Since then, we worked on further developing ideas to improve upon the base method. In the current paper we target two areas where we believe there is potential for improvement. In the first one we address the finegrained structure of WordNets (wn) sense inventory (i.e. the topic of the task in this particular track). The second issue we address here, deals with topic domain specilisation of the base method.
Pharmacoepidemiology and Drug Safety | 2011
Amanda Nicholson; Anne Rosemary Tate; Rob Koeling; Jackie Cassell
Electronic health records are increasingly used for research. The definition of cases or endpoints often relies on the use of coded diagnostic data, using a pre‐selected group of codes. Validation of these cases, as ‘true’ cases of the disease, is crucial. There are, however, ambiguities in what is meant by validation in the context of electronic records. Validation usually implies comparison of a definition against a gold standard of diagnosis and the ability to identify false negatives (‘true’ cases which were not detected) as well as false positives (detected cases which did not have the condition). We argue that two separate concepts of validation are often conflated in existing studies. Firstly, whether the GP thought the patient was suffering from a particular condition (which we term confirmation or internal validation) and secondly, whether the patient really had the condition (external validation). Few studies have the ability to detect false negatives who have not received a diagnostic code. Natural language processing is likely to open up the use of free text within the electronic record which will facilitate both the validation of the coded diagnosis and searching for false negatives. Copyright
meeting of the association for computational linguistics | 1997
Mark-Jan Nederhof; Gosse Bouma; Rob Koeling; Gertjan van Noord
We argue that grammatical processing is a viable alternative to concept spotting for processing spoken input in a practical dialogue system. We discuss the structure of the grammar, the properties of the parser, and a method for achieving robustness. We discuss test results suggesting that grammatical processing allows fast and accurate processing of spoken input.
north american chapter of the association for computational linguistics | 2009
Peng Jin; Diana McCarthy; Rob Koeling; John A. Carroll
Word sense distributions are usually skewed. Predicting the extent of the skew can help a word sense disambiguation (WSD) system determine whether to consider evidence from the local context or apply the simple yet effective heuristic of using the first (most frequent) sense. In this paper, we propose a method to estimate the entropy of a sense distribution to boost the precision of a first sense heuristic by restricting its application to words with lower entropy. We show on two standard datasets that automatic prediction of entropy can increase the performance of an automatic first sense heuristic.