Nicole H. Augustin
University of Bath
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nicole H. Augustin.
Biometrics | 1997
Stephen T. Buckland; K. P. Burnham; Nicole H. Augustin
We argue that model selection uncertainty should be fully incorporated into statistical inference whenever estimation is sensitive to model choice and that choice is made with reference to the data. We consider different philosophies for achieving this goal and suggest strategies for data analysis. We illustrate our methods through three examples. The first is a Poisson regression of bird counts in which a choice is to be made between inclusion of one or both of two covariates. The second is a line transect data set for which different models yield substantially different estimates of abundance. The third is a simulated example in which truth is known.
Ecological Modelling | 2002
Simon N. Wood; Nicole H. Augustin
Generalized Additive Models (GAMs) have been popularized by the work of Hastie and Tibshirani (1990) and the availability of user friendly GAM software in Splus. However, whilst it is flexible and ecient, the GAM framework based on backfitting with linear smoothers presents some diculties when it comes to model selection and
Cancer Research | 2006
Olga Vasiljeva; Anna Papazoglou; Achim Krüger; Harald Brodoefel; Matvey Korovin; Jan M. Deussing; Nicole H. Augustin; Boye Schnack Nielsen; Kasper Almholt; Matthew Bogyo; Christoph Peters; Thomas Reinheckel
Proteolysis in close vicinity of tumor cells is a hallmark of cancer invasion and metastasis. We show here that mouse mammary tumor virus-polyoma middle T antigen (PyMT) transgenic mice deficient for the cysteine protease cathepsin B (CTSB) exhibited a significantly delayed onset and reduced growth rate of mammary cancers compared with wild-type PyMT mice. Lung metastasis volumes were significantly reduced in PyMT;ctsb(+/-), an effect that was not further enhanced in PyMT;ctsb(-/-) mice. Furthermore, lung colonization studies of PyMT cells with different CTSB genotypes injected into congenic wild-type mice and in vitro Matrigel invasion assays confirmed a specific role for tumor-derived CTSB in invasion and metastasis. Interestingly, cell surface labeling of cysteine cathepsins by the active site probe DCG-04 detected up-regulation of cathepsin X on PyMT;ctsb(-/-) cells. Treatment of cells with a neutralizing anti-cathepsin X antibody significantly reduced Matrigel invasion of PyMT;ctsb(-/-) cells but did not affect invasion of PyMT;ctsb(+/+) or PyMT;ctsb(+/-) cells, indicating a compensatory function of cathepsin X in CTSB-deficient tumor cells. Finally, an adoptive transfer model, in which ctsb(+/+), ctsb(+/-), and ctsb(-/-) recipient mice were challenged with PyMT;ctsb(+/+) cells, was used to address the role of stroma-derived CTSB in lung metastasis formation. Notably, ctsb(-/-) mice showed reduced number and volume of lung colonies, and infiltrating macrophages showed a strongly up-regulated expression of CTSB within metastatic cell populations. These results indicate that both cancer cell-derived and stroma cell-derived (i.e., macrophages) CTSB plays an important role in tumor progression and metastasis.
Journal of the American Statistical Association | 2009
Nicole H. Augustin; Monica Musio; Klaus von Wilpert; Edgar Kublin; Simon N. Wood; Martin Schumacher
Forest health monitoring schemes were set up across Europe in the 1980s in response to concerns about air pollution-related forest dieback (Waldsterben) and have continued since then. Recent threats to forest health are climatic extremes likely due to global climate change and increased ground ozone levels and nitrogen deposition. We model yearly data on tree crown defoliation, an indicator of tree health, from a monitoring survey carried out in Baden-Württemberg, Germany since 1983. On a changing irregular grid, defoliation and other site-specific variables are recorded. In Baden-Württemberg, the temporal trend of defoliation differs among areas because of site characteristics and pollution levels, making it necessary to allow for space–time interaction in the model. For this purpose, we propose using generalized additive mixed models (GAMMs) incorporating scale-invariant tensor product smooths of the space–time dimensions. The space–time smoother allows separate smoothing parameters and penalties for the space and time dimensions and thus avoids the need to make arbitrary or ad hoc choices about the relative scaling of space and time. The approach of using a space–time smoother has intuitive appeal, making it easy to explain and interpret when communicating the results to nonstatisticians, such as environmental policy makers. The model incorporates a nonlinear effect for mean tree age, the most important predictor, allowing the separation of trends in time, which may be pollution-related, from trends that relate purely to the aging of the survey population. In addition to a temporal trend due to site characteristics and other conditions modeled with the space–time smooth, we account for random temporal correlation at site level by an autoregressive moving average (ARMA) process. Model selection is carried out using the Bayes information criterion (BIC), and the adequacy of the assumed spatial and temporal error structure is investigated with the empirical semivariogram and the empirical autocorrelation function.
Statistical Modelling | 2005
Nicole H. Augustin; Willi Sauerbrei; Martin Schumacher
Predictions of disease outcome in prognostic factor models are usually based on one selected model. However, often several models fit the data equally well, but these models might differ substantially in terms of included explanatory variables and might lead to different predictions for individual patients. For survival data, we discuss two approaches to account for model selection uncertainty in two data examples, with the main emphasis on variable selection in a proportional hazard Cox model. The main aim of our investigation is to establish the ways in which either of the two approaches is useful in such prognostic models. The first approach is Bayesian model averaging (BMA) adapted for the proportional hazard model, termed ‘approx. BMA’ here. As a new approach, we propose a method which averages over a set of possible models using weights estimated from bootstrap resampling as proposed by Buckland et al., but in addition, we perform an initial screening of variables based on the inclusion frequency of each variable to reduce the set of variables and corresponding models. For some necessary parameters of the procedure, investigations concerning sensible choices are still required. The main objective of prognostic models is prediction, but the interpretation of single effects is also important and models should be general enough to ensure transportability to other clinical centres. In the data examples, we compare predictions of our new approach with approx. BMA, with ‘conventional’ predictions from one selected model and with predictions from the full model. Confidence intervals are compared in one example. Comparisons are based on the partial predictive score and the Brier score. We conclude that the two model averaging methods yield similar results and are especially useful when there is a high number of potential prognostic factors, most likely some of them without influence in a multivariable context. Although the method based on bootstrap resampling lacks formal justification and requires some ad hoc decisions, it has the additional positive effect of achieving model parsimony by reducing the number of explanatory variables and dealing with correlated variables in an automatic fashion.
Journal of Agricultural Biological and Environmental Statistics | 2006
Nicole H. Augustin; James W. McNicol; Carol A. Marriott
With vegetation data there are often physical reasons for believing that the response of neighbors has a direct influence on the response at a particular location. In terms of modeling such scenarios the family of auto-models or Markov random fields is a useful choice. If the observed responses are counts, the auto-Poisson model can be used. There are different ways to formulate the auto-Poisson model, depending on the biological context. A drawback of this model is that for positive autocorrelation the likelihood of the auto-Poisson model is not available in closed form. We investigate how this restriction can be avoided by right truncating the distribution. We review different parameter estimation techniques which apply to auto-models in general and compare them in a simulation study. Results suggest that the method which is most easily implemented via standard statistics software, maximum pseudo-likelihood, gives unbiased point estimates, but its variance estimates are biased. An alternative method, Monte Carlo maximum likelihood, works well but is computer-intensive and not available in standard software. We illustrate the methodology and techniques for model checking with clover leaf counts and seed count data from an agricultural experiment.
Computational Statistics & Data Analysis | 2012
Nicole H. Augustin; Erik-André Sauleau; Simon N. Wood
The distributional assumption for a generalized linear model is often checked by plotting the ordered deviance residuals against the quantiles of a standard normal distribution. Such plots can be difficult to interpret, because even when the model is correct, the plot often deviates substantially from a straight line. To rectify this problem Ben and Yohai (2004) proposed plotting the deviance residuals against their theoretical quantiles, under the assumption that the model is correct. Such plots are closer to a straight line, when the model is correct, making them much more useful for model checking. However the quantile computation proposed in Ben and Yohai is, in general, relatively complicated to implement and computationally expensive, so that general purpose software for these plots is only available for the Poisson and binary cases in the R package robust. As an alternative the theoretical quantiles can efficiently and simply be estimated by repeatedly simulating new response data from the fitted model and computing the corresponding residuals. This method also provides reference bands for judging the significance of departures of QQ-plots from ideal straight line form. A second alternative is to estimate the quantiles using quantiles of the response variable distribution according to the estimated model. This latter alternative generally has lower computational cost than the first, but does not yield QQ-plot reference bands. In simulations the quantiles produced by the new methods give results indistinguishable from the original Ben and Yohai quantile computations, but the scaling of computational cost with sample size is much improved so that a 500 fold reduction in computation time was observed at sample size 50,000. Application of the methods to generalized linear models fitted to prostate cancer incidence data suggest that they are particularly useful in large dataset cases that might otherwise be incorrectly viewed as zero-inflated. The new approaches are simple enough to implement for any exponential family distribution and for several alternative types of residual, and this has been done for all the families available for use with generalized linear models in the basic distribution of R.
European Journal of Forest Research | 2007
Monica Musio; Klaus von Wilpert; Nicole H. Augustin
One of the aims of this work is to describe how the target variable “tree vitality” in terms of needle loss is affected by other explanatory variables. To describe such a relationship in a realistic way, we use generalized additive mixed models (GAMMs) which allow to take spatial correlation of the data into account and in addition allow the inclusion of explanatory variables as predictors with the possibility of having non-linear effects. The GAMMs are fitted in a Bayesian framework using Markov chain Monte Carlo techniques. Data are available for two years 1988 and 1994. We select a set of best explanatory variables from a large set of variables including tree-specific variables, such as species, age, nutrients in the needles and site-specific variables such as altitude, relief type, soil depth and content of different nutrients in the top soil. In the two models for 1988 and 1994, different sets of explanatory variables were selected as best predictors. In both models, the effects of explanatory variables allowed a plausible interpretation. For example, the site-specific variables such as relief and soil depth were significant predictors, since these factors determine how well water and nutrient supply is balanced at a specific site. The selected sets of explanatory variables differed between 1988 and 1994, giving an indication of a possible change in the main causes of forest deterioration between 1988 and 1994. From the set of nutrient variables measured in the soil and in the needles, in 1988 altitude a.s.l. and magnesium supply were among the explanatory variables, in 1994 a combination of Al in the soil and the N/K-ratio (in the needles) was selected in the model. In 1988 altitude a.s.l. was among the most important predictors in the model. This is in contrast to 1994 where altitude was not selected. This may have to do with the fact that in the early phase of forest health monitoring (1988) one of the main causes of forest deterioration was magnesium deficiency. Later on this may have changed to a combination of soil acidification and nitrogen eutrophication. Thus by using an adequate model such as the GAMM, sets of explanatory variables for needle loss may be identified. By fitting two GAMMs, with different sets of “best” predictors, at two time points 1988 and 1994, we can detect changes in these sets of “best” predictors over time. This allows us to use the monitoring data with the tree vitality indicator crown condition/needle loss as a tool for forest health management, which may involve decisions about concrete counter measures like e.g. forest liming.
European Journal of Forest Research | 2008
Edgar Kublin; Nicole H. Augustin; Juha Lappi
We present a functional regression model for diameter prediction. Usually stem form is estimated from a regression model using dbh and height of the sample tree as predictor. With our model additional diameter observations measured at arbitrary locations within the sample tree can be incorporated in the estimation in order to calibrate a standard prediction based on dbh and height. For this purpose, the stem form of a sample tree is modelled as a smooth random function. The observed diameters are assumed as independent realizations from a sample of possible trajectories of the stem contour. The population average of the stem form within a given dbh and height class is estimated with the taper curves applied in the national forest inventory in Germany. Tree deviation from the population average is modelled with the help of a Karhunen–Loève expansion for the random part of the trajectory. Eigenfunctions and scores of the Karhunen–Loève expansion are estimated through conditional expectations within the methodological framework of functional principal component analysis (FPCA). In addition to a calibrated estimation of the stem form, FPCA provides asymptotic pointwise or simultaneous confidence intervals for the calibrated diameter predictions. For the application of functional principal component analysis modelling the covariance function of the random process is crucial. The main features of the functional regression model are discussed informally and demonstrated by means of practical examples.
Journal of the American Statistical Association | 2017
Simon N. Wood; Zheyuan Li; Gavin Shaddick; Nicole H. Augustin
Abstract We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 104 coefficients to up to 108 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration’s pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.