William Fithian | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where William Fithian is active.

Explore More

Publication

Featured researches published by William Fithian.

Methods in Ecology and Evolution | 2015

Point process models for presence-only analysis

Ian W. Renner; Jane Elith; Adrian Baddeley; William Fithian; Trevor Hastie; Steven J. Phillips; Gordana C. Popovic; David I. Warton

Summary Presence-only data are widely used for species distribution modelling, and point process regression models are a flexible tool that has considerable potential for this problem, when data arise as point events. In this paper, we review point process models, some of their advantages and some common methods of fitting them to presence-only data. Advantages include (and are not limited to) clarification of what the response variable is that is modelled; a framework for choosing the number and location of quadrature points (commonly referred to as pseudo-absences or ‘background points’) objectively; clarity of model assumptions and tools for checking them; models to handle spatial dependence between points when it is present; and ways forward regarding difficult issues such as accounting for sampling bias. Point process models are related to some common approaches to presence-only species distribution modelling, which means that a variety of different software tools can be used to fit these models, including maxent or generalised linear modelling software.

Biometrika | 2015

Effective degrees of freedom: a flawed metaphor

Lucas Janson; William Fithian; Trevor Hastie

To most applied statisticians, a fitting procedures degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.

Annals of Statistics | 2014

Local case-control sampling: efficient subsampling in imbalanced data sets

William Fithian; Trevor Hastie

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

Journal of the American Statistical Association | 2013

Selection Adjusted Confidence Intervals With More Power to Determine the Sign

Asaf Weinstein; William Fithian; Yoav Benjamini

In many current large-scale problems, confidence intervals (CIs) are constructed only for the parameters that are large, as indicated by their estimators, ignoring the smaller parameters. Such selective inference poses a problem to the usual marginal CIs that no longer offer the right level of coverage, not even on the average over the selected parameters. We address this problem by developing three methods to construct short and valid CIs for the location parameter of a symmetric unimodal distribution, while conditioning on its estimator being larger than some constant threshold. In two of these methods, the CI is further required to offer early sign determination, that is, to avoid including parameters of both signs for relatively small values of the estimator. One of the two, the Conditional Quasi-Conventional CI, offers a good balance between length and sign determination while protecting from the effect of selection. The CI is not symmetric, extending more toward 0 than away from it, nor is it of constant shape. However, when the estimator is far away from the threshold, the proposed CI tends to the usual marginal one. In spite of its complexity, it is specified by closed form expressions, up to a small set of constants that are each the solution of a single variable equation. When multiple testing procedures are used to control the false discovery rate or other error rates, the resulting threshold for selecting may be data dependent. We show that conditioning the above CIs on the data-dependent threshold still offers false coverage-statement rate (FCR) for many widely used testing procedures. For these reasons, the conditional CIs for the parameters selected this way are an attractive alternative to the available general FCR adjusted intervals. We demonstrate the use of the method in the analysis of some 14,000 correlations between hormone change and brain activity change in response to the subjects being exposed to stressful movie clips. Supplementary materials for this article are available online.

Journal of Multivariate Analysis | 2017

Multiple correspondence analysis and the multilogit bilinear model

William Fithian; Julie Josse

Multiple correspondence analysis is a dimension reduction technique which plays a large role in the analysis of tables with categorical nominal variables, such as survey data. Though it is usually motivated and derived using geometric considerations, we prove that in fact, it can be seen as a single proximal Newton step of a natural bilinear exponential family model for categorical data: the multinomial logit bilinear model. We compare and contrast the behavior of multiple correspondence analysis with that of this model on simulated data, and discuss new insights into both approaches and their cognate models. Consequently, multiple correspondence analysis can be used to approximate the parameters of the multilogit model. Indeed, estimating the model’s parameters is non-trivial, whereas multiple correspondence analysis has the advantage of being easily solved by a singular value decomposition, and scalable to large data sets. We illustrate the methods on a survey of the drinking habits in France in the context of European policies against the harmful effects of alcohol on society.

Statistical Science | 2018

Flexible Low-Rank Statistical Modeling with Missing Data and Side Information

William Fithian; Rahul Mazumder

We explore a general statistical framework for low-rank modeling of matrix-valued data, based on convex optimization with a generalized nuclear norm penalty. We study several related problems: the usual low-rank matrix completion problem with flexible loss functions arising from generalized linear models; reduced-rank regression and multi-task learning; and generalizations of both problems where side information about rows and columns is available, in the form of features or smoothing kernels. We show that our approach encompasses maximum a posteriori estimation arising from Bayesian hierarchical modeling with latent factors, and discuss ramifications of the missing-data mechanism in the context of matrix completion. While the above problems can be naturally posed as rank-constrained optimization problems, which are nonconvex and computationally difficult, we show how to relax them via generalized nuclear norm regularization to obtain convex optimization problems. We discuss algorithms drawing inspiration from modern convex optimization methods to address these large scale convex optimization computational tasks. Finally, we illustrate our flexible approach in problems arising in functional data reconstruction and ecological species distribution modeling.

Ecography | 2013