Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Gareth J. Janacek is active.

Publication


Featured researches published by Gareth J. Janacek.


knowledge discovery and data mining | 2004

Clustering time series from ARMA models with clipped data

Anthony J. Bagnall; Gareth J. Janacek

Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. In this paper we focus on clustering data derived from Autoregressive Moving Average (ARMA) models using k-means and k-medoids algorithms with the Euclidean distance between estimated model parameters. We justify our choice of clustering technique and distance metric by reproducing results obtained in related research. Our research aim is to assess the affects of discretising data into binary sequences of above and below the median, a process known as clipping, on the clustering of time series. It is known that the fitted AR parameters of clipped data tend asymptotically to the parameters for unclipped data. We exploit this result to demonstrate that for long series the clustering accuracy when using clipped data from the class of ARMA models is not significantly different to that achieved with unclipped data. Next we show that if the data contains outliers then using clipped data produces significantly better clusterings. We then demonstrate that using clipped series requires much less memory and operations such as distance calculations can be much faster. Finally, we demonstrate these advantages on three real world data sets.


Data Mining and Knowledge Discovery | 2006

A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Anthony J. Bagnall; Chotirat Ann Ratanamahatana; Eamonn J. Keogh; Stefano Lonardi; Gareth J. Janacek

Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper, we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than other popular representations. The usefulness of the proposed representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable Run Length Encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform Kolmogorov based clustering.


Machine Learning | 2005

Clustering Time Series with Clipped Data

Anthony J. Bagnall; Gareth J. Janacek

Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. Time series data are often large and may contain outliers. We show that the simple procedure of clipping the time series (discretising to above or below the median) reduces memory requirements and significantly speeds up clustering without decreasing clustering accuracy. We also demonstrate that clipping increases clustering accuracy when there are outliers in the data, thus serving as a means of outlier detection and a method of identifying model misspecification. We consider simulated data from polynomial, autoregressive moving average and hidden Markov models and show that the estimated parameters of the clipped data used in clustering tend, asymptotically, to those of the unclipped data. We also demonstrate experimentally that, if the series are long enough, the accuracy on clipped data is not significantly less than the accuracy on unclipped data, and if the series contain outliers then clipping results in significantly better clusterings. We then illustrate how using clipped series can be of practical benefit in detecting model misspecification and outliers on two real world data sets: an electricity generation bid data set and an ECG data set.


Neural Networks | 2007

2007 Special Issue: Predictive uncertainty in environmental modelling

Gavin C. Cawley; Gareth J. Janacek; M. R. Haylock; Stephen Dorling

Artificial neural networks have proved an attractive approach to non-linear regression problems arising in environmental modelling, such as statistical downscaling, short-term forecasting of atmospheric pollutant concentrations and rainfall run-off modelling. However, environmental datasets are frequently very noisy and characterised by a noise process that may be heteroscedastic (having input dependent variance) and/or non-Gaussian. The aim of this paper is to review an existing methodology for estimating predictive uncertainty in such situations, and more importantly illustrate how a model of the predictive distribution may be exploited in assessing the possible impacts of climate change and to improve current decision making processes. The results of the WCCI-2006 predictive uncertainty in environmental modelling challenge are also reviewed and some areas suggested where further research may provide significant benefits.


Fisheries Research | 1999

Sampling trips for measuring discards in commercial fishing based on multilevel modelling of measurements in the North Sea from NE England

Duncan Tamsett; Gareth J. Janacek

Abstract This study estimates the effects of variables (fishing, spatial, temporal) on discarding and proposes a scheme for measuring discarding. Data are used from fishing in the English North Sea for cod, haddock and whiting. Analysis of variance of discarding rates (numbers discarded/numbers caught) has provided estimates of components of variance associated with variables. Significant estimates of components of variance have emerged from data spanning 3 years. There is a large component of variance associated with trips (the lowest level at which data are available) reflecting an inherent noisiness in discarding rates. For haddock and whiting, there are significant components of variance associated with years . An effect of years on cod was not found for the years for which data are available. The variables: gear type , port of landing , fishing grounds and quarter , are crossed at a common level. Of these gear contributes the greatest variation in discarding rates. Port and grounds are also important contributors. Quarter has a small but significant effect. Multivariate analysis of variance has produced evidence for strong multilevel correlations in discarding rates between haddock and whiting. Discarding rates vary significantly with size of catch (trip) for cod and for haddock. Variables are reduced to strata , and analyses of variance for models incorporating strata carried out. An approach to sampling based on stratification with proportional allocation is discussed. This provides efficient sampling and ensures representative sampling in a problem for which there are relatively few observations, and several variables affecting the response variable.


knowledge discovery and data mining | 2005

A likelihood ratio distance measure for the similarity between the fourier transform of time series

Gareth J. Janacek; Anthony J. Bagnall; M. Powell

Fast Fourier Transforms (FFTs) have been a popular transformation and compression technique in time series data mining since first being proposed for use in this context in [1]. The Euclidean distance between coefficients has been the most commonly used distance metric with FFTs. However, on many problems it is not the best measure of similarity available. In this paper we describe an alternative distance measure based on the likelihood ratio statistic to test the hypothesis of difference between series. We compare the new distance measure to Euclidean distance on five types of data with varying levels of compression. We show that the likelihood ratio measure is better at discriminating between series from different models and grouping series from the same model.


Fisheries Research | 1999

Onboard sampling for measuring discards in commercial fishing based on multilevel modelling of measurements in the Irish Sea from NW England and N Wales

Duncan Tamsett; Gareth J. Janacek; Mike Emberton; Bill Lart; Grant Course

Abstract A method for onboard sampling of discards developed for the North Sea was found to be unsatisfactory in the Irish Sea. An alternative method for the Irish Sea provided data between August 1993 and September 1994 on discarding rates at the haul and within-haul levels, as well as the trip level (the lowest level for which data are available from the method used in the North Sea). The data are subjected to multilevel analysis of variance to measure components of variance associated with fishing and environmental parameters. Discarding rates at the haul level are affected by distance from the coast and duration of haul. Within-haul estimates of discarding rates are analysed for evidence that estimates from samples of catches for hauls are over-dispersed (relative to the dispersion associated with binomially distributed discarding rates). This has a bearing on the optimal size that samples of catches for hauls should be. No evidence for over-dispersion is found for the Irish Sea for which the data set with within-haul data is small. However, analysis of a data set from the English Channel for which there is a greater volume of data at the within-haul level, has indicated significant and sizeable over-dispersion. An approach is outlined for calculating the optimal size that samples of catches for hauls should be for estimating discarding rates as a function of year class, and taking account of over-dispersion. Optimal sample sizes are surprisingly small. This might render practicable the acquisition of samples of hauls by fishermen in the absence of an onboard technician for analysis by technicians post-trip.


Journal of Climate | 2008

Real-Time Extraction of the Madden-Julian Oscillation Using Empirical Mode Decomposition and Statistical Forecasting with a VARMA Model

Barnaby S. Love; Adrian J. Matthews; Gareth J. Janacek

Abstract A simple guide to the new technique of empirical mode decomposition (EMD) in a meteorological–climate forecasting context is presented. A single application of EMD to a time series essentially acts as a local high-pass filter. Hence, successive applications can be used to produce a bandpass filter that is highly efficient at extracting a broadband signal such as the Madden–Julian oscillation (MJO). The basic EMD method is adapted to minimize end effects, such that it is suitable for use in real time. The EMD process is then used to efficiently extract the MJO signal from gridded time series of outgoing longwave radiation (OLR) data. A range of statistical models from the general class of vector autoregressive moving average (VARMA) models was then tested for their suitability in forecasting the MJO signal, as isolated by the EMD. A VARMA (5, 1) model was selected and its parameters determined by a maximum likelihood method using 17 yr of OLR data from 1980 to 1996. Forecasts were then made on the...


international symposium on neural networks | 2007

Generalised Kernel Machines

Gavin C. Cawley; Gareth J. Janacek; Nicola L. C. Talbot

The generalised linear model (GLM) is the standard approach in classical statistics for regression tasks where it is appropriate to measure the data misfit using a likelihood drawn from the exponential family of distributions. In this paper, we apply the kernel trick to give a non-linear variant of the GLM, the generalised kernel machine (GKM), in which a regularised GLM is constructed in a fixed feature space implicitly defined by a Mercer kernel. The MATLAB symbolic maths toolbox is used to automatically create a suite of generalised kernel machines, including methods for automated model selection based on approximate leave-one-out cross-validation. In doing so, we provide a common framework encompassing a wide range of existing and novel kernel learning methods, and highlight their connections with earlier techniques from classical statistics. Examples including kernel ridge regression, kernel logistic regression and kernel Poisson regression are given to demonstrate the flexibility and utility of the generalised kernel machine.


Journal of Classification | 2014

A Run Length Transformation for Discriminating Between Auto Regressive Time Series

Anthony J. Bagnall; Gareth J. Janacek

We describe a simple time series transformation to detect differences in series that can be accurately modelled as stationary autoregressive (AR) processes. The transformation involves forming the histogram of above and below the mean run lengths. The run length (RL) transformation has the benefits of being very fast, compact and updatable for new data in constant time. Furthermore, it can be generated directly from data that has already been highly compressed. We first establish the theoretical asymptotic relationship between run length distributions and AR models through consideration of the zero crossing probability and the distribution of runs. We benchmark our transformation against two alternatives: the truncated Autocorrelation function (ACF) transform and the AR transformation, which involves the standard method of fitting the partial autocorrelation coefficients with the Durbin-Levinson recursions and using the Akaike Information Criterion stopping procedure. Whilst optimal in the idealized scenario, representing the data in these ways is time consuming and the representation cannot be updated online for new data. We show that for classification problems the accuracy obtained through using the run length distribution tends towards that obtained from using the full fitted models. We then propose three alternative distance measures for run length distributions based on Gower’s general similarity coefficient, the likelihood ratio and dynamic time warping (DTW). Through simulated classification experiments we show that a nearest neighbour distance based on DTW converges to the optimal faster than classifiers based on Euclidean distance, Gower’s coefficient and the likelihood ratio. We experiment with a variety of classifiers and demonstrate that although the RL transform requires more data than the best performing classifier to achieve the same accuracy as AR or ACF, this factor is at worst non-increasing with the series length, m, whereas the relative time taken to fit AR and ACF increases with m. We conclude that if the data is stationary and can be suitably modelled by an AR series, and if time is an important factor in reaching a discriminatory decision, then the run length distribution transform is a simple and effective transformation to use.

Collaboration


Dive into the Gareth J. Janacek's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Gavin C. Cawley

University of East Anglia

View shared research outputs
Top Co-Authors

Avatar

Duncan Tamsett

University of East Anglia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

L. Swift

University of East Anglia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sue Bailey

University of East Anglia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge