William John Teahan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where William John Teahan is active.

Explore More

Publication

Featured researches published by William John Teahan.

Archive | 2003

Using Compression-Based Language Models for Text Categorization

William John Teahan; David J. Harper

Text compression models are firmly grounded in information theory, and we exploit this theoretical underpinning in applying text compression to text categorization. Category models are constructed using the Prediction by Partial Matching (PPM) text compression scheme, specifically using character-based rather than word-based contexts. Two approaches to compression-based categorization are presented, one based on ranking by document cross entropy (average bits per coded symbol) with respect to a category model, and the other based on document cross entropy difference between category and complement of category models. Formally, we show the equivalence of the latter approach to two-class Bayes classification, and propose a method for performing feature selection within our compression-based categorization framework.

international acm sigir conference on research and development in information retrieval | 2003

A repetition based measure for verification of text collections and for text categorization

Dmitry V. Khmelev; William John Teahan

We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.

IEEE Transactions on Computers | 2005

Universal text preprocessing for data compression

Jürgen Abel; William John Teahan

Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.

international joint conference on artificial intelligence | 2011

Constituent grammatical evolution

Loukas Georgiou; William John Teahan

We present Constituent Grammatical Evolution (CGE), a new evolutionary automatic programming algorithm that extends the standard Grammatical Evolution algorithm by incorporating the concepts of constituent genes and conditional behaviour-switching. CGE builds from elementary and more complex building blocks a control program which dictates the behaviour of an agent and it is applicable to the class of problems where the subject of search is the behaviour of an agent in a given environment. It takes advantage of the powerful Grammatical Evolution feature of using a BNF grammar definition as a plug-in component to describe the output language to be produced by the system. The main benchmark problem in which CGE is evaluated is the Santa Fe Trail problem using a BNF grammar definition which defines a search space semantically equivalent with that of the original definition of the problem by Koza. Furthermore, CGE is evaluated on two additional problems, the Loss Altos Hills and the Hampton Court Maze. The experimental results demonstrate that Constituent Grammatical Evolution outperforms the standard Grammatical Evolution algorithm in these problems, in terms of both efficiency (percent of solutions found) and effectiveness (number of required steps of solutions found).

Natural Language Engineering | 2008

A new ppm variant for chinese text compression

Peiliang Wu; William John Teahan

Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files.

Information Visualization | 2015

Storyboarding for visual analytics

Richard L. Walker; Llyr ap Cenydd; Serban R. Pop; Helen C. Miles; Chris J. Hughes; William John Teahan; Jonathan C. Roberts

Analysts wish to explore different hypotheses, organize their thoughts into visual narratives and present their findings. Some developers have used algorithms to ascertain key events from their data, while others have visualized different states of their exploration and utilized free-form canvases to enable the users to develop their thoughts. What is required is a visual layout strategy that summarizes specific events and allows users to layout the story in a structured way. We propose the use of the concept of ‘storyboarding’ for visual analytics. In film production, storyboarding techniques enable film directors and those working on the film to pre-visualize the shots and evaluate potential problems. We present six principles of storyboarding for visual analytics: composition, viewpoints, transition, annotability, interactivity and separability. We use these principles to develop epSpread, which we apply to VAST Challenge 2011 microblogging data set and to Twitter data from the 2012 Olympic Games. We present technical challenges and design decisions for developing the epSpread storyboarding visual analytics tool that demonstrate the effectiveness of our design and discuss lessons learnt with the storyboarding method.

european conference on information retrieval | 2005

Knowing-aboutness: question-answering using a logic-based framework

William John Teahan

We describe the background and motivation for a logic-based framework, based on the theory of “Knowing-Aboutness”, and its specific application to Question-Answering. We present the salient features of our system, and outline the benefits of our framework in terms of a more integrated architecture that is more easily evaluated. Favourable results are presented in the TREC 2004 Question-Answering evaluation.

international acm sigir conference on research and development in information retrieval | 2004

Context-based methods for text categorisation

D. S. Hunnisett; William John Teahan

We propose several context-based methods for text categorization. One method, a small modification to the PPM compression-based model which is known to significantly degrade compression performance, counter-intuitively has the opposite effect on categorization performance. Another method, called C-measure, simply counts the presence of higher order character contexts, and outperforms all other approaches investigated.

visual analytics science and technology | 2011

epSpread - Storyboarding for visual analytics

Llyr ap Cenydd; Richard L. Walker; Serban R. Pop; Helen C. Miles; Chris J. Hughes; William John Teahan; Jonathan C. Roberts

We present epSpread, an analysis and storyboarding tool for geolocated microblogging data. Individual time points and ranges are analysed through queries, heatmaps, word clouds and streamgraphs. The underlying narrative is shown on a storyboard-style timeline for discussion, refinement and presentation. The tool was used to analyse data from the VAST Challenge 2011 Mini-Challenge 1, tracking the spread of an epidemic using microblogging data. In this article we describe how the tool was used to identify the origin and track the spread of the epidemic.

genetic and evolutionary computation conference | 2013

Template based evolution

Christopher J. Headleand; William John Teahan

This paper describes a novel approach to multi-agent simulation where agents evolve freely within their environment. We present Template Based Evolution (TBE), a genetic evolution algorithm that evolves behaviour for embodied situated agents whose fitness is tested implicitly through repeated trials in an environment. All agents that survive in the environment breed freely, creating new agents based on the average genome of two parents. This paper describes the design of the algorithm and applies it to a model where virtual migratory creatures are evolved to survive the simulated environment. Comparisons made between the evolutionary responses of the artificial creatures and observations of natural systems justify the strength of the methodology for species simulation.

Explore More