Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where W. J. Teahan is active.

Publication


Featured researches published by W. J. Teahan.


data compression conference | 1995

Unbounded length contexts for PPM

John G. Cleary; W. J. Teahan; Ian H. Witten

The prediction by partial matching (PPM) data compression scheme has set the performance standard in lossless compression of text throughout the past decade. The original algorithm was first published in 1984 by Cleary and Witten, and a series of improvements was described by Moffat (1990), culminating in a careful implementation, called PPMC, which has become the benchmark version. This still achieves results superior to virtually all other compression methods, despite many attempts to better it. PPM, is a finite-context statistical modeling technique that can be viewed as blending together several fixed-order context models to predict the next character in the input sequence. Prediction probabilities for each context in the model are calculated from frequency counts which are updated adaptively; and the symbol that actually occurs is encoded relative to its predicted distribution using arithmetic coding. The paper describes a new algorithm, PPM*, which exploits contexts of unbounded length. It reliably achieves compression superior to PPMC, although our current implementation uses considerably greater computational resources (both time and space). The basic PPM compression scheme is described, showing the use of contexts of unbounded length, and how it can be implemented using a tree data structure. Some results are given that demonstrate an improvement of about 6% over the old method.


data compression conference | 1996

The entropy of English using PPM-based models

W. J. Teahan; John G. Cleary

The purpose of this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by the previous results. This follows from a number of observations: firstly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiments; secondly, using large amounts of priming text substantially improves the PPMs performance; and thirdly, the PPM algorithm can be modified to perform better for English text. The result of this is a machine performance down to 1.46 bit per character. The problem of estimating the entropy of English is discussed. The importance of training text for PPM is demonstrated, showing that its performance can be improved by adjusting the alphabet used. The results based on these improvements are then given, with compression down to 1.46 bpc.


data compression conference | 1995

Experiments on the zero frequency problem

John G. Cleary; W. J. Teahan

Summary form only given. A fundamental problem in the construction of statistical techniques for data compression of sequential text is the generation of probabilities from counts of previous occurrences. Each context used in the statistical model accumulates counts of the number of times each symbol has occurred in that context. So in a binary alphabet there will be two counts C/sub 0/ and C/sub 1/ (the number of times a 0 or 1 has occurred). The problem then is to take the counts and generate from them a probability that the next character will be a 0 or 1. A naive estimate of the probability of character i could be obtained by the ratio p/sub i/=C/sub i//(C/sub 0/+C/sub i/). A fundamental problem with this is that it will generate a zero probability if C/sub 0/ or C/sub 1/ is zero. Unfortunately, a zero probability prevents coding from working correctly as the optimum code length in this case is infinite. Consequently any estimate of the probabilities must be non-zero even in the presence of zero counts. This problem is called the zero frequency problem . A well known solution to the problem was formulated by Laplace and is known as Laplaces law of succession. We have investigated the correctness of Laplaces law by experiment.


data compression conference | 1998

Correcting English text using PPM models

W. J. Teahan; Stuart J. Inglis; John G. Cleary; Geoffrey Holmes

An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized, while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out, This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 96.3% to 96.9%, a decrease of about 14 errors per page.


data compression conference | 1997

Models of English text

W. J. Teahan; John G. Cleary

The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.


data compression conference | 1998

Tag based models of English text

W. J. Teahan; John G. Cleary

The problem of compressing English text is important both because of the ubiquity of English as a target for compression and because of the light that compression can shed on the structure of English. English text is examined in conjunction with additional information about the parts of speech of each word in the text (these are referred to as tags). It is shown that the tags plus the text can be compressed more than the text alone. Essentially the tags can be compressed for nothing or even a small net saving in size. A comparison is made of a number of different ways of integrating compression of tags and text using an escape mechanism similar to PPM. These are also compared with standard word based and character based compression programs. The result is that the tag and word based schemes always outperform the character based schemes. Overall, the tag based schemes outperform the word based schemes. We conclude by conjecturing that tags chosen for compression rather than linguistic purposes would perform even better.


data compression conference | 1999

An open interface for probabilistic models of text

John G. Cleary; W. J. Teahan

Summary form only given. An application program interface (API) for meddling sequential text is described. The API is intended to shield the user from details of the modelling and probability estimation process. This should enable different implementations of models to be replaced transparently in application programs. The motivation for this API is work on the use of textual models for applications in addition to strict data compression. The API is probabilistic, that is, it supplies the probability of the next symbol in the sequence. It is general enough to deal accurately with models that include escapes for probabilities. The concepts abstracted by the API are explained together with details of the API calls. Such predictive models can be used for a number of applications other than compression. Users of the models do not want to be concerned about the details either of the implementation of the models or how they were trained and the sources of the training text. The problem considered is how to permit code for different models and actual trained models themselves to be interchanged easily between users. The fundamental idea is that it should be possible to write application programs independent of the details of particular modelling code, that it should be possible to implement different modelling code independent of the various applications, and that it should be possible to easily exchange different pre-trained models between users. It is hoped that this independence will foster the exchange and use of high-performance modelling code, the construction of sophisticated adaptive systems based on the best available models, and the proliferation and provision of high-quality models of standard text types such as English or other natural languages, and easy comparison of different modelling techniques.


The Computer Journal | 1997

Unbounded Length Contexts for PPM

John G. Cleary; W. J. Teahan


Archive | 1999

Using language models for generic entity extraction

Ian H. Witten; Zane Bray; Malika Mahoui; W. J. Teahan


Archive | 1995

Unbounded context lengths for ppm

John G. Cleary; W. J. Teahan; Ian H. Witten

Collaboration


Dive into the W. J. Teahan's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ian H. Witten

University of Canterbury

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zane Bray

University of Waikato

View shared research outputs
Researchain Logo
Decentralizing Knowledge