Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Adam Kilgarriff is active.

Publication


Featured researches published by Adam Kilgarriff.


Computers and The Humanities | 1997

I don't believe in word senses*

Adam Kilgarriff

Word sense disambiguation assumes word senses. Withinthe lexicography and linguistics literature, they areknown to bevery slippery entities. The first part of the paperlooks at problemswith existing accounts of ‘word sense’ and describesthe various kinds of ways in which a words meaning candeviate from its coremeaning. An analysis is presented in which wordsenses areabstractions from clusters of corpus citations, inaccordance withcurrent lexicographic practice. The corpus citations,not the wordsenses, are the basic objects in the ontology. Thecorpus citationswill be clustered into senses according to thepurposes of whoever or whatever does the clustering. In theabsence of suchpurposes, word senses do not exist.Word sense disambiguation also needs a set of wordsenses todisambiguate between. In most recent work, the sethas been takenfrom a general-purpose lexical resource, with theassumption that thelexical resource describes the word senses ofEnglish/French/...,between which NLP applications will need todisambiguate. Theimplication of the first part of the paper is, bycontrast, that wordsenses exist only relative to a task. Thefinal part of the paper pursues this, exploring, bymeans of asurvey, whether and how word sense ambiguity is infact a problem forcurrent NLP applications.


Computational Linguistics | 2007

Googleology is Bad Science

Adam Kilgarriff

Lexical Computing Ltd. andUniversity of SussexThe World Wide Web is enormous, free, immediately available, and largely linguistic.As we discover, on ever more fronts, that language analysis and generation benefitfrom big data, so it becomes appealing to use the Web as a data source. The question,then, is how.The low-entry-cost way to use the Web is via a commercial search engine. If thegoal is to find frequencies or probabilities for some phenomenon of interest, we can usethe hit count given in the search engine’s hits page to make an estimate. People havebeen doing this for some time now. Early work using hit counts include Grefenstette(1999), who identified likely translations for compositional phrases, and Turney (2001),who found synonyms; perhaps the most cited study is Keller and Lapata (2003), whoestablished the validity of frequencies gathered in this way using experiments withhuman subjects. Leading recent work includes Nakov and Hearst (2005), who buildmodels of noun compound bracketing.The initial-entry cost for this kind of research is zero. Given a computer and anInternet connection, you input the query and get a hit count. But if the work is toproceed beyond the anecdotal, a range of issues must be addressed.First, the commercial search engines do not lemmatize or part-of-speech tag. To takea simple case: To estimate frequencies for the verb-object pair


Computers and The Humanities | 2000

Framework and Results for English SENSEVAL

Adam Kilgarriff; Joseph Rosenzweig

Senseval was the first open, community-based evaluation exercisefor Word Sense Disambiguation programs. It adopted the quantitativeapproach to evaluation developed in MUC and other ARPA evaluationexercises. It took place in 1998. In this paper we describe thestructure, organisation and results of the SENSEVAL exercise forEnglish. We present and defend various design choices for theexercise, describe the data and gold-standard preparation, considerissues of scoring strategies and baselines, and present the resultsfor the 18 participating systems. The exercise identifies thestate-of-the-art for fine-grained word sense disambiguation, wheretraining data is available, as 74–78% correct, with a number ofalgorithms approaching this level of performance. For systems thatdid not assume the availability of training data, performance wasmarkedly lower and also more variable. Human inter-tagger agreementwas high, with the gold standard taggings being around 95%replicable.


conference of the european chapter of the association for computational linguistics | 2006

Large linguistically-processed web corpora for multiple languages

Marco Baroni; Adam Kilgarriff

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and near-duplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian, with corpus sizes of over 1 billion words in each case. We provide Web access to the corpora in our query tool, the Sketch Engine.


Computers and The Humanities | 2000

Introduction to the Special Issue on SENSEVAL

Adam Kilgarriff; Martha Palmer

Senseval was the first open, community-based evaluation exercise for WordSense Disambiguation programs. It took place in the summer of 1998,with tasks for English, French and Italian. There were participating systems from 23 researchgroups. This special issueis an account of the exercise. In addition to describing the contentsof the volume, this introduction considers how the exercise has shedlight on some general questions about wordsenses and evaluation.


Corpus Linguistics and Linguistic Theory | 2005

Language is never, ever, ever, random

Adam Kilgarriff

Abstract Language users never choose words randomly, and language is essentially non-random. Statistical hypothesis testing uses a null hypothesis, which posits randomness. Hence, when we look at linguistic phenomena in corpora, the null hypothesis will never be true. Moreover, where there is enough data, we shall (almost) always be able to establish that it is not true. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. We present experimental evidence of how arbitrary associations between word frequencies and corpora are systematically non-random. We review literature in which hypothesis testing has been used, and show how it has often led to unhelpful or misleading results.


Computer Speech & Language | 1998

Gold standard datasets for evaluating word sense disambiguation programs

Adam Kilgarriff

Abstract There are now many computer programs for automatically determining the sense in which a word is being used. One would like to be able to say which are better, which worse, and also which words, or varieties of language, present particular problems to which algorithms. An evaluation exercise is required, and such an exercise requires a “gold standard” dataset of correct answers. Producing this proves to be a difficult and challenging task. In this paper I discuss the background, challenges and strategies, and present a detailed methodology for ensuring that the gold standard is not fools gold.


Computers and The Humanities | 1992

Dictionary word sense distinctions: An enquiry into their nature

Adam Kilgarriff

The word senses in a published dictionary are a valuable resource for natural language processing and textual criticism alike. In order that they can be further exploited, their nature must be better understood. Lexicographers have always had to decide where to say a word has one sense, where two. The two studies described here look into their grounds for making distinctions. The first develops a classification scheme to describe the commonly occurring distinction types. The second examines the task of matching the usages of a word from a corpus with the senses a dictionary provides. Finally, a view of the ontological status of dictionary word senses is presented.


text speech and dialogue | 2012

Getting to Know Your Corpus

Adam Kilgarriff

Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read it, but corpora are mostly too big to read (and not designed to be read). We show, with examples, how keyword lists (of one corpus vs. another) are a direct, practical and fascinating way to explore the characteristics of corpora, and of text types. Our method is to classify the top one hundred keywords of corpus1 vs. corpus2, and corpus2 vs. corpus1. This promptly reveals a range of contrasts between all the pairs of corpora we apply it to. We also present improved maths for keywords, and briefly discuss quantitative comparisons between corpora. All the methods discussed (and almost all of the corpora) are available in the Sketch Engine, a leading corpus query tool.


international conference natural language processing | 2003

Thesauruses for natural language processing

Adam Kilgarriff

We argue that manual and automatic thesauruses are alternative resources for the same NLP tasks. This involves the radical step of interpreting manual thesauruses as classifications of words rather than word senses: the case for this is made. The range of roles for thesauruses within NLP is briefly presented and the WASPS thesaurus is introduced. Thesaurus evaluation is now becoming urgent. A range of evaluation strategies, all embedded within NLP tasks, is proposed.

Collaboration


Dive into the Adam Kilgarriff's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Roger Evans

University of Brighton

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge