Michael P. Oakes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael P. Oakes is active.

Explore More

Publication

Featured researches published by Michael P. Oakes.

international acm sigir conference on research and development in information retrieval | 2003

Word sense disambiguation in information retrieval revisited

Christopher Stokoe; Michael P. Oakes; John Tait

Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguation, the question remains as to whether less than 90% accurate automated word sense disambiguation can lead to improvements in retrieval effectiveness. In this study we explore the development and subsequent evaluation of a statistical word sense disambiguation system which demonstrates increased precision from a sense based vector space retrieval model over traditional TF*IDF techniques.

Literary and Linguistic Computing | 2007

Use of the Chi-Squared Test to Examine Vocabulary Differences in English Language Corpora Representing Seven Different Countries

Michael P. Oakes; Malcolm Farrow

The chi-squared test is used to find the vocabulary most typical of seven different ICAME corpora, each representing the English used in a particular country. In a closely related study, Leech and Fallon (1992, Computer corpora - what do they tell us about culture? ICAME Journal, 16: 29-50) found differences in the vocabulary used in the Brown Corpus of American English and that the Lancaster-Oslo-Bergen Corpus of British English. They were mainly interested in those vocabulary differences which they assumed to be due to cultural differences between the United States and Britain, but we are equally interested in vocabulary differences which reveal linguistic preferences in the various countries in which English is spoken. Whether vocabulary differences are cultural or linguistic in nature, they can be used for the automatic classification according to variety of English of texts of unknown provenance. The extent to which the vocabulary differences between the corpora represent vocabulary differences between the varieties of English as a whole depends on the extent to which the corpora represent the full range of topics typical of their associated cultures, and thus there is a need for corpora designed to represent the topics and vocabulary of cultures or dialects, rather than stratified across a set range of topics and genres. This will require methods to determine the range of topics addressed in each culture, then methods to sample adequately from each topical domain.

Journal of Quantitative Linguistics | 2000

Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages

Michael P. Oakes

Even if no written records of a protolanguage remain, it is possible to estimate what some of the words in that language might have been, by comparison of its reflexes in the more recent daughter languages. This method of protolanguage reconstruction is called the ‘comparative method’, and is described by Crowley (1992, Chapter 5). Although long practiced by human linguists, the comparative method is extremely time consuming, and only a few parts of the process have previously been automated. This paper describes a program which attempts to replicate the methodology described by Crowley almost in its entirety.

european conference on information retrieval | 2006

Browsing personal images using episodic memory (time + location)

Chufeng Chen; Michael P. Oakes; John Tait

In this paper we consider episodic memory for system design in image retrieval. Time and location are the main factors in episodic memory, and these types of data were combined for image event clustering. We conducted a user studies to compare five image browsing systems using searching time and user satisfaction as criteria for success. Our results showed that the browser which clusters images based on time and location data combined was significantly better than four other more standard browsers. This suggests that episodic memory is potentially useful for improving personal image management.

european conference on information retrieval | 2006

PERC: a personal email classifier

Shih-Wen Ke; Chris Bowerman; Michael P. Oakes

Improving the accuracy of assigning new email messages to small folders can reduce the likelihood of users creating duplicate folders for some topics. In this paper we presented a hybrid classification model, PERC, and use the Enron Email Corpus to investigate the performance of kNN, SVM and PERC in a simulation of a real-time situation. Our results show that PERC is significantly better at assigning messages to small folders. The effects of different parameter settings for the classifiers are discussed.

Archive | 2014

Literary Detective Work on the Computer

Michael P. Oakes

Computational linguistics can be used to uncover mysteries in text which are not always obvious to visual inspection. For example, the computer analysis of writing style can show who might be the true author of a text in cases of disputed authorship or suspected plagiarism. The theoretical background to authorship attribution is presented in a step by step manner, and comprehensive reviews of the field are given in two specialist areas, the writings of William Shakespeare and his contemporaries, and the various writing styles seen in religious texts. The final chapter looks at the progress computers have made in the decipherment of lost languages. This book is written for students and researchers of general linguistics, computational and corpus linguistics, and computer forensics. It will inspire future researchers to study these topics for themselves, and gives sufficient details of the methods and resources to get them started.

Polibits | 2009

Cross Language Information Retrieval using Multilingual Ontology as Translation and Query Expansion Base

Mustafa A. Abusalah; John Tait; Michael P. Oakes

Abstract —This paper reports an experiment to evaluate a Cross Language Information Retrieval (CLIR) system that uses a multilingual ontology to improve query translation in the travel domain. The ontology-based approach significantly outperformed the Machine Readable Dictionary translation baseline using Mean Average Precision as a metric in a user-centered experiment. Index terms —Ontology, multilingual, cross language information retrieval. I. I NTRODUCTION HE growing requirement on the Internet for users to access information expressed in language other than their own has led to Cross Language Information Retrieval (CLIR) becoming established as a major topic in IR. One approach to CLIR uses different translation approaches to translate queries to documents and indexes in other languages. As queries submitted to search engines suffer lack of context, translation approaches have great problems with resolving query ambiguity. In our approach, we built a multilingual ontology to be used as a translation base for CLIR. In this paper we evaluate our proposed query translation methodology and compare it with a base line system that uses a Machine Readable Dictionary (MRD) as translation base in a user-centered experiment. II. B

international conference on applications of digital information and web technologies | 2008

Comparison between document-based, term-based and hybrid partitioning

Ahmad Abusukhon; Michael P. Oakes; Mohammad Talib; Ayman M. Abdalla

Information retrieval (IR) systems for largescale data collections must build an index in order to provide efficient retrieval that meets the userpsilas needs. In distributed IR systems, query response time is affected by the way in which the data collection is partitioned across nodes. There are three types of collection partitioning; document-based partitioning (called the local index), term-based partitioning (called the global index) and hybrid partitioning. In this paper, we compare the three types of partitioning in terms of average query response time for a system with one broker and six other nodes. Our results showed that within our distributed IR system, the document-based and hybrid partitioning outperformed the term-based partitioning. However, unlike Xi et al. , we did not find that hybrid partitioning was any better than document-based partitioning in terms of average query response time.

content based multimedia indexing | 2008

Real AdaBoost for large vocabulary image classification

Wei-Chao Lin; Michael P. Oakes; John Tait

In this paper, we describe the use of a Boosting algorithm, Real AdaBoost, for content-based image retrieval (CBIR) on a large number (190) of keyword categories. Previous work with Boosting for image orientation detection has involved only a few categories, such as a simple outdoor vs. indoor scene dichotomy. Other work with CBIR has incorporated Boosting into relevance feedback for a form of supervised learning based on end-userspsila evaluation, but here we use AdaBoost as a purely learning algorithm to reduce noisy and outlier information. For the 190-category classification task, Real AdaBoost with its own final learner model outperformed the k-nearest neighbour (K-NN) classifier in terms of precision.

international symposium on neural networks | 2010

Semantic Subspace Learning with conditional significance vectors

Nandita Tripathi; Stefan Wermter; Chihli Hung; Michael P. Oakes

Subspace detection and processing is receiving more attention nowadays as a method to speed up search and reduce processing overload. Subspace Learning algorithms try to detect low dimensional subspaces in the data which minimize the intra-class separation while maximizing the inter-class separation. In this paper we present a novel technique using the maximum significance value to detect a semantic subspace. We further modify the document vector using conditional significance to represent the subspace. This enhances the distinction between classes within the subspace. We compare our method against TFIDF with PCA and show that it consistently outperforms the baseline with a large margin when tested with a wide variety of learning algorithms. Our results show that the combination of subspace detection and conditional significance vectors improves subspace learning.

Explore More