Robert J. May | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert J. May is active.

Explore More

Publication

Featured researches published by Robert J. May.

Environmental Modelling and Software | 2008

Non-linear variable selection for artificial neural networks using partial mutual information

Robert J. May; Holger R. Maier; Graeme C. Dandy; T.M.K. Gayani Fernando

Artificial neural networks (ANNs) have been widely used to model environmental processes. The ability of ANN models to accurately represent the complex, non-linear behaviour of relatively poorly understood processes makes them highly suited to this task. However, the selection of an appropriate set of input variables during ANN development is important for obtaining high-quality models. This can be a difficult task when considering that many input variable selection (IVS) techniques fail to perform adequately due to an underlying assumption of linearity, or due to redundancy within the available data. This paper focuses on a recently proposed IVS algorithm, based on estimation of partial mutual information (PMI), which can overcome both of these issues and is considered highly suited to the development of ANN models. In particular, this paper addresses the computational efficiency and accuracy of the algorithm via the formulation and evaluation of alternative techniques for determining the significance of PMI values estimated during selection. Furthermore, this paper presents a rigorous assessment of the PMI-based algorithm and clearly demonstrates the superior performance of this non-linear IVS technique in comparison to linear correlation-based techniques.

Environmental Modelling and Software | 2008

Application of partial mutual information variable selection to ANN forecasting of water quality in water distribution systems

Robert J. May; Graeme C. Dandy; Holger R. Maier; John B. Nixon

Recent trends in the management of water supply have increased the need for modelling techniques that can provide reliable, efficient, and accurate representation of the complex, non-linear dynamics of water quality within water distribution systems. Statistical models based on artificial neural networks (ANNs) have been found to be highly suited to this application, and offer distinct advantages over more conventional modelling techniques. However, many practitioners utilise somewhat heuristic or ad hoc methods for input variable selection (IVS) during ANN development. This paper describes the application of a newly proposed non-linear IVS algorithm to the development of ANN models to forecast water quality within two water distribution systems. The intention is to reduce the need for arbitrary judgement and extensive trial-and-error during model development. The algorithm utilises the concept of partial mutual information (PMI) to select inputs based on the analysis of relationship strength between inputs and outputs, and between redundant inputs. In comparison with an existing approach, the ANN models developed using the IVS algorithm are found to provide optimal prediction with significantly greater parsimony. Furthermore, the results obtained from the IVS procedure are useful for developing additional insight into the important relationships that exist between water distribution system variables.

Archive | 2011

Review of Input Variable Selection Methods for Artificial Neural Networks

Robert J. May; Graeme C. Dandy; Holger R. Maier

The choice of input variables is a fundamental, and yet crucial consideration in identifying the optimal functional form of statistical models. The task of selecting input variables is common to the development of all statistical models, and is largely dependent on the discovery of relationships within the available data to identify suitable predictors of the model output. In the case of parametric, or semi-parametric empirical models, the difficulty of the input variable selection task is somewhat alleviated by the a priori assumption of the functional form of the model, which is based on some physical interpretation of the underlying system or process being modelled. However, in the case of artificial neural networks (ANNs), and other similarly data-driven statistical modelling approaches, there is no such assumption made regarding the structure of the model. Instead, the input variables are selected from the available data, and the model is developed subsequently. The difficulty of selecting input variables arises due to (i) the number of available variables, which may be very large; (ii) correlations between potential input variables, which creates redundancy; and (iii) variables that have little or no predictive power. Variable subset selection has been a longstanding issue in fields of applied statistics dealing with inference and linear regression (Miller, 1984), and the advent of ANN models has only served to create new challenges in this field. The non-linearity, inherent complexity and non-parametric nature of ANN regression make it difficult to apply many existing analytical variable selection methods. The difficulty of selecting input variables is further exacerbated during ANN development, since the task of selecting inputs is often delegated to the ANN during the learning phase of development. A popular notion is that an ANN is adequately capable of identifying redundant and noise variables during training, and that the trained network will use only the salient input variables. ANN architectures can be built with arbitrary flexibility and can be successfully trained using any combination of input variables (assuming they are good predictors). Consequently, allowances are often made for a large number of input variables, with the belief that the ability to incorporate such flexibility and redundancy creates a more robust model. Such pragmatism is perhaps symptomatic of the popularisation of ANN models through machine learning, rather than statistical learning theory. ANN models are too often developed without due consideration given to the effect that the choice of input variables has on model complexity, learning difficulty, and performance of the subsequently trained ANN. 1

Neural Networks | 2010

Data splitting for artificial neural networks using SOM-based stratified sampling

Robert J. May; Holger R. Maier; Graeme C. Dandy

Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets.

international joint conference on neural network | 2006

Critical Values of a Kernel Density-based Mutual Information Estimator

Robert J. May; Graeme C. Dandy; Holger R. Maier; T.M.K.G. Fernando

Recently, mutual information (MI) has become widely recognized as a statistical measure of dependence that is suitable for applications where data are non-Gaussian, or where the dependency between variables is non-linear. However, a significant disadvantage of this measure is the inability to define an analytical expression for the distribution of MI estimators, which are based upon a finite dataset. This paper deals specifically with a popular kernel density based estimator, for which the distribution is determined empirically using Monte Carlo simulation. The application of the critical values of MI derived from this distribution to a test for independence is demonstrated within the context of a benchmark input variable selection problem.

World Water and Environmental Resources Congress 2004 | 2004

General Regression Neural Networks for Modeling Disinfection Residual in Water Distribution Systems

Robert J. May; Holger R. Maier; Graeme C. Dandy; John B. Nixon

Water treatment plant (WTP) operators set disinfectant levels such that a balance is m aintained between achieving adequate disinfection and minimising the undesirable effects of excessive disinfection residuals. Control sy stems for the optimal mai ntenance of disi nfection residuals are based upon a model that attempts to describe the non-linear dynamics of the water distribution system (WDS). A system identification approach, based on artificial neural networks (ANNs), offers an expedient methodology for t he devel opment of contr ol-oriented models. An advantage of ANNs is their ability to de scribe non-linear systems with great er accuracy than linear empirical models that are tr aditionally used for system identification. In this paper, the parallel development of a general regression neural network (GRNN) model and an autoregressive model with exogenous inputs (ARX) is described for the Myponga WDS in Sout h Australia. The results indicate the superiority of the GRNN model and support further investigation of WDS control systems that incorporate ANN identification models.

Water Resources Research | 2013