Gavin C. Cawley | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gavin C. Cawley is active.

Explore More

Publication

Featured researches published by Gavin C. Cawley.

Pattern Recognition | 2003

Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers

Gavin C. Cawley; Nicola L. C. Talbot

Abstract Mika et al. (in: Neural Network for Signal Processing, Vol. IX, IEEE Press, New York, 1999; pp. 41–48) apply the “kernel trick” to obtain a non-linear variant of Fishers linear discriminant analysis method, demonstrating state-of-the-art performance on a range of benchmark data sets. We show that leave-one-out cross-validation of kernel Fisher discriminant classifiers can be implemented with a computational complexity of only O (l 3 ) operations rather than the O (l 4 ) of a naive implementation, where l is the number of training patterns. Leave-one-out cross-validation then becomes an attractive means of model selection in large-scale applications of kernel Fisher discriminant analysis, being significantly faster than conventional k-fold cross-validation procedures commonly used.

Atmospheric Environment | 2003

Extensive evaluation of neural network models for the prediction of NO2 and PM10 concentrations, compared with a deterministic modelling system and measurements in central Helsinki

Jaakko Kukkonen; Leena Partanen; Ari Karppinen; Juhani Ruuskanen; Heikki Junninen; Mikko Kolehmainen; Harri Niska; Stephen Dorling; Tim Chatterton; Rob Foxall; Gavin C. Cawley

Five neural network (NN) models, a linear statistical model and a deterministic modelling system (DET) were evaluated for the prediction of urban NO2 and PM10 concentrations. The model evaluation work considered the sequential hourly concentration time series of NO2 and PM10, which were measured at two stations in central Helsinki, from 1996 to 1999. The models utilised selected traffic flow and pre-processed meteorological variables as input data. An imputed concentration dataset was also created, in which the missing values were replaced, in order to obtain a harmonised database that is well suited for the inter-comparison of models. Three statistical criteria were adopted: the index of agreement (IA), the squared correlation coefficient (R2) and the fractional bias. The results obtained with various non-linear NN models show a good agreement with the measured concentration data for NO2; for instance, the annual mean of the IA values and their standard deviations range from 0.86±0.02 to 0.91±0.01. In the case of NO2, the non-linear NN models produce a range of model performance values that are slightly better than those by the DET. NN models generally perform better than the statistical linear model, for predicting both NO2 and PM10 concentrations. In the case of PM10, the model performance statistics of the NN models were not as good as those for NO2 over the entire range of models considered. However, the currently available NN models are neither applicable for predicting spatial concentration distributions in urban areas, nor for evaluating air pollution abatement scenarios for future years.

Bioinformatics | 2006

Gene selection in cancer classification using sparse logistic regression with Bayesian regularization

Gavin C. Cawley; Nicola L. C. Talbot

MOTIVATION Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffreys prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification. RESULTS The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense. AVAILABILITY A MATLAB implementation of the sparse logistic regression algorithm with Bayesian regularization (BLogReg) is available from http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/

international joint conference on neural network | 2006

Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs

Gavin C. Cawley

While the model parameters of many kernel learning methods are given by the solution of a convex optimisation problem, the selection of good values for the kernel and regularisation parameters, i.e. model selection, is much less straight-forward. This paper describes a simple and efficient approach to model selection for weighted least-squares support vector machines, and compares a variety of model selection criteria based on leave-one-out cross-validation. An external cross-validation procedure is used for performance estimation, with model selection performed independently in each fold to avoid selection bias. The best entry based on these methods was ranked in joint first place in the WCCI-2006 performance prediction challenge, demonstrating the effectiveness of this approach.

Atmospheric Environment | 2003

A rigorous inter-comparison of ground-level ozone predictions

Uwe Schlink; Stephen Dorling; Emil Pelikán; Giuseppe Nunnari; Gavin C. Cawley; Heikki Junninen; Alison J. Greig; Rob Foxall; Kryštof Eben; Tim Chatterton; Jiri Vondracek; Matthias Richter; Michal Dostál; L. Bertucco; Mikko Kolehmainen; Martin Doyle

Novel statistical approaches to prediction have recently been shown to perform well in several scientific fields but have not, until now, been comprehensively evaluated for predicting air pollution. In this paper we report on a model inter-comparison exercise in which 15 different statistical techniques for ozone forecasting were applied to ten data sets representing different meteorological and emission conditions throughout Europe. We also attempt to compare the performance of the statistical techniques with a deterministic chemical trajectory model. Likewise, our exercise includes comparisons of sites, performance indices, forecasting horizons, etc. The comparative evaluation of forecasting performance (benchmarking) produced 1340 yearly time series of daily predictions and the results are described in terms of predefined performance indices. Through analysing associations between the performance indices, we found that the success index is of outstanding significance. For models that are excellent in predicting threshold exceedances and have a high success index, we also observe high performance in the overall goodness of fit. The 8-h average ozone concentration forecast accuracy was found to be superior to the 1-h mean ozone concentration forecast, which makes the former very significant for operational forecasting. The best forecasts were achieved for sites located in rural and suburban areas in Central Europe unaffected by extreme emissions (e.g. from industries). Our results demonstrate that a particular technique is often excellent in some respects but poor in others. For most situations, we recommend neural network and generalised additive models as the best compromise, as these can handle nonlinear associations and can be easily adapted to site specific conditions. In contrast, nonlinear modelling of the dynamical development of univariate ozone time-series was not profitable.

Environmental Modelling and Software | 2006

Statistical models to assess the health effects and to forecast ground-level ozone

Uwe Schlink; Olf Herbarth; Matthias Richter; Stephen Dorling; Giuseppe Nunnari; Gavin C. Cawley; Emil Pelikán

By means of statistical approaches we attempt to bridge both aspects of the ground-level ozone problem: assessment of health effects and forecasting and warning. Disagreement has been highlighted in the literature recently regarding the adverse health effects of tropospheric ozone pollution. Based on a panel study of children in Leipzig we identified a non-linear (quadratic) concentration-response relationship between ozone and respiratory symptoms. Our results indicate that using ozone as a linear covariate might be a misspecification of the model, which might explain non-uniform results of several field studies in health effects of ozone. We conclude that there is urgent demand for forecasting episodes of high ozone that may help susceptible persons to avoid high exposure. Novel approaches to statistical modelling and data mining are helpful tools in operational smog forecasting. We present a rigorous assessment of the performance of 15 different statistical techniques in an inter-comparison study based on data sets from 10 European regions. To evaluate the results of the inter-comparison exercise we suggest an integrated assessment procedure, which takes the unbalanced study design into consideration. This procedure is based on estimating a statistical model for the performance indices depending on predefined factors, such as site, forecasting technique, forecasting horizon, etc. We find that the best predictions can be achieved for sites located in rural and suburban areas in Central Europe. For application in operational air pollution forecasting we may recommend neural network and generalised additive models, which can handle non-linear associations between atmospheric variables. As an example we demonstrate the application of a Generalised Additive Model (GAM). GAMs are based on smoothing splines for the covariates, i.e., meteorological parameters and concentrations of other pollutants. Finally, it transpired that respiratory symptoms are associated with the daily maximum of the 8-h average ozone concentration, which in turn is best predicted by means of non-linear statistical models. The new air quality directive of the European Commission (Directive 2002/3/EC) accounts for the special relevance of the 8h mean ozone concentration.

Environmental Modelling and Software | 2004

Modelling SO2 concentration at a point with statistical approaches

Giuseppe Nunnari; Stephen Dorling; Uwe Schlink; Gavin C. Cawley; Robert J. Foxall; T. Chatterton

In this paper, the results obtained by inter-comparing several statistical techniques for modelling SO2 concentration at a point such as neural networks, fuzzy logic, generalised additive techniques and other recently proposed statistical approaches are reported. The results of the inter-comparison are the fruits of collaboration between some of the partners of the APPETISE project funded under the Framework V Information Societies and Technologies (IST) programme. Two different cases for study were selected: the Siracusa industrial area, in Italy, where the pollution is dominated by industrial emissions and the Belfast urban area, in the UK, where domestic heating makes an important contribution. The different kinds of pollution (industrial/urban) and different locations of the areas considered make the results more general and interesting. In order to make the inter-comparison more objective, all the modellers considered the same datasets. Missing data in the original time series was filled by using appropriate techniques. The inter-comparison work was carried out on a rigorous basis according to the performance indices recommended by the European Topic Centre on Air and Climate Change (ETC/ACC). The targets for the implemented prediction models were defined according to the EC normative relating to limit values for sulphur dioxide. According to this normative, three different kinds of targets were considered namely daily mean values, daily maximum values and hourly mean values. The inter-compared models were tested on real cases of poor air quality. In the paper, the inter-compared techniques are ranked in terms of their capability to predict critical episodes. A ranking in terms of their predictability of the three different targets considered is also proposed. Several key issues are illustrated and discussed such as the role of input variable selection, the use of meteorological data, and the use of interpolated time series. Moreover, a novel approach referred to as the technique of balancing the training pattern set, which was successfully applied to improve the capability of ANN models to predict exceedences is introduced. The results show that there is no single modelling approach, which generates optimum results in terms of the full range of performance indices considered. In view of the implementation of a warning system for air quality control, approaches that are able to work better in the prediction of critical episodes must be preferred. Therefore, the artificial neural network prediction models can be recommended for this purpose. The best forecasts were achieved for daily averages of SO2 while daily maximum and hourly mean values are difficult to predict with acceptable accuracy.

conference on image and video retrieval | 2002

Non-retrieval: Blocking Pornographic Images

Alison Bosson; Gavin C. Cawley; Yi Chan; Richard W. Harvey

We extend earlier work on detecting pornographic images. Our focus is on the classification stage and we give new results for a variety of classical and modern classifiers. We find the artificial neural network offers a statistically significant improvement. In all cases the error rate is too high unless deployed sensitively so we show how such a system may be built into a commercial environment.

Neurocomputing | 2002

Improved sparse least-squares support vector machines ☆

Gavin C. Cawley; Nicola L. C. Talbot

Suykens et al. (Neurocomputing (2002), in press) describe a weighted least-squares formulation of the support vector machine for regression problems and present a simple algorithm for sparse approximation of the typically fully dense kernel expansions obtained using this method. In this paper, we present an improved method for achieving sparsity in least-squares support vector machines, which takes into account the residuals for all training patterns, rather than only those incorporated in the sparse kernel expansion. The superiority of this algorithm is demonstrated on the motorcycle and Boston housing data sets.

Journal of Immunological Methods | 2011

Machine learning competition in immunology – Prediction of HLA class I binding peptides

Guang Lan Zhang; Hifzur Rahman Ansari; Phil Bradley; Gavin C. Cawley; Tomer Hertz; Xihao Hu; Nebojsa Jojic; Yohan Kim; Oliver Kohlbacher; Ole Lund; Claus Lundegaard; Craig A. Magaret; Morten Nielsen; Harris Papadopoulos; Gajendra P. S. Raghava; Vider-Shalit Tal; Li C. Xue; Chen Yanover; Shanfeng Zhu; Michael T. Rock; James E. Crowe; Christos G. Panayiotou; Marios M. Polycarpou; Włodzisław Duch; Vladimir Brusic

Experimental studies of immune system and related applications such as characterization of immune responses against pathogens, vaccine design, or optimization of therapies are combinatorially complex, time-consuming and expensive. The main methods for large-scale identification of T-cell epitopes from pathogens or cancer proteomes involve either reverse immunology or high-throughput mass spectrometry (HTMS). Reverse immunology approaches involve pre-screening of proteomes by computational algorithms, followed by experimental validation of selected targets (Mora et al., 2006; De Groot et al., 2008; Larsen et al., 2010). HTMS involves HLA typing, immunoaffinity chromatography of HLA molecules, HLA extraction, and chromatography combined with tandem mass spectrometry, followed by the application of computational algorithms for peptide characterization (Bassani-Sternberg et al., 2010). Hundreds of naturally processed HLA class I associated peptides have been identified in individual studies using HTMS in normal (Escobar et al., 2008), cancer (Antwi et al., 2009; Bassani-Sternberg et al., 2010), autoimmunity-related (Ben Dror et al., 2010), and infected samples (Wahl et al, 2010). Computational algorithms are essential steps in highthroughput identification of T-cell epitope candidates using both reverse immunology and HTMS approaches. Peptide binding to MHC molecules is the single most selective step in defining T cell epitope and the accuracy of computational algorithms for prediction of peptide binding, therefore, determines the accuracy of the overall method. Computational predictions of peptide binding to HLA, both class I and class II, use a variety of algorithms ranging from binding motifs to advanced machine learning techniques (Brusic et al., 2004; Lafuente and Reche, 2009) and standards for their

Explore More