Featured Researches

Other Statistics

Opening practice: supporting Reproducibility and Critical spatial data science

This paper reflects on a number of trends towards a more open and reproducible approach to geographic and spatial data science over recent years. In particular it considers trends towards Big Data, and the impacts this is having on spatial data analysis and modelling. It identifies a turn in academia towards coding as a core analytic tool, and away from proprietary software tools offering 'black boxes' where the internal workings of the analysis are not revealed. It is argued that this closed form software is problematic, and considers a number of ways in which issues identified in spatial data analysis (such as the MAUP) could be overlooked when working with closed tools, leading to problems of interpretation and possibly inappropriate actions and policies based on these. In addition, this paper and considers the role that reproducible and open spatial science may play in such an approach, taking into account the issues raised. It highlights the dangers of failing to account for the geographical properties of data, now that all data are spatial (they are collected somewhere), the problems of a desire for n=all observations in data science and it identifies the need for a critical approach. This is one in which openness, transparency, sharing and reproducibility provide a mantra for defensible and robust spatial data science.

Read more
Other Statistics

Openness and Reproducibility: Insights from a Model-Centric Approach

This paper investigates the conceptual relationship between openness and reproducibility using a model-centric approach, heavily informed by probability theory and statistics. We first clarify the concepts of reliability, auditability, replicability, and reproducibility--each of which denotes a potential scientific objective. Then we advance a conceptual analysis to delineate the relationship between open scientific practices and these objectives. Using the notion of an idealized experiment, we identify which components of an experiment need to be reported and which need to be repeated to achieve the relevant objective. The model-centric framework we propose aims to contribute precision and clarity to the discussions surrounding the so-called reproducibility crisis.

Read more
Other Statistics

Optimal spectral shrinkage and PCA with heteroscedastic noise

This paper studies the related problems of prediction, covariance estimation, and principal component analysis for the spiked covariance model with heteroscedastic noise. We consider an estimator of the principal components based on whitening the noise, and we derive optimal singular value and eigenvalue shrinkers for use with these estimated principal components. Underlying these methods are new asymptotic results for the high-dimensional spiked model with heteroscedastic noise, and consistent estimators for the relevant population parameters. We extend previous analysis on out-of-sample prediction to the setting of predictors with whitening. We demonstrate certain advantages of noise whitening. Specifically, we show that in a certain asymptotic regime, optimal singular value shrinkage with whitening converges to the best linear predictor, whereas without whitening it converges to a suboptimal linear predictor. We prove that for generic signals, whitening improves estimation of the principal components, and increases a natural signal-to-noise ratio of the observations. We also show that for rank one signals, our estimated principal components achieve the asymptotic minimax rate.

Read more
Other Statistics

Organic fiducial inference

A substantial generalisation is put forward of the theory of subjective fiducial inference as it was outlined in earlier papers. In particular, this theory is extended to deal with cases where the data are discrete or categorical rather than continuous, and cases where there was important pre-data knowledge about some or all of the model parameters. The system for directly expressing and then handling this pre-data knowledge, which is via what are referred to as global and local pre-data functions for the parameters concerned, is distinct from that which involves attempting to directly represent this knowledge in the form of a prior distribution function over these parameters, and then using Bayes' theorem. In this regard, the individual attributes of what are identified as three separate types of fiducial argument, namely the strong, moderate and weak fiducial arguments, form an integral part of the theory that is developed. Various practical examples of the application of this theory are presented, including examples involving binomial, Poisson and multinomial data. The fiducial distribution functions for the parameters of the models in these examples are interpreted in terms of a generalised definition of subjective probability that was set out previously.

Read more
Other Statistics

PUMA criterion = MODE criterion

We show that the recently proposed (enhanced) PUMA estimator for array processing minimizes the same criterion function as the well-established MODE estimator. (PUMA = principal-singular-vector utilization for modal analysis, MODE = method of direction estimation.)

Read more
Other Statistics

Perceptive Statistical Variability Indicators

The concepts of variability and uncertainty, both epistemic and alleatory, came from experience and coexist with different connotations. Therefore this article attempts to express their relation by analytic means firstly setting sights on their differences and then on their common characteristics. Inspired with the definition of average number of equally probable events based on entropy concept in probability theory, the article introduced two related perceptive statistical measures which indicate the same variability as the basic probability distribution. First is the equivalent number of a hypothetical distribution with one sure and all the other impossible outcomes which indicates variability. Second is the appropriate equivalent number of a hypothetical distribution with all equal probabilities which indicates invariability. The article interprets the common properties of variability and uncertainty on theoretical distributions and on ocean-wide wind wave directional properties by using the long term observations compiled in the Global Wave Statistics.

Read more
Other Statistics

Perspective from the Literature on the Role of Expert Judgment in Scientific and Statistical Research and Practice

This article, produced as a result of the Symposium on Statistical Inference, is an introduction to the literature on the function of expertise, judgment, and choice in the practice of statistics and scientific research. In particular, expert judgment plays a critical role in conducting Frequentist hypothesis tests and Bayesian models, especially in selection of appropriate prior distributions for model parameters. The subtlety of interpreting results is also discussed. Finally, external recommendations are collected for how to more effectively encourage proper use of judgment in statistics. The paper synthesizes the literature for the purpose of creating a single reference and inciting more productive discussions on how to improve the future of statistics and science.

Read more
Other Statistics

Peter Hall's work on high-dimensional data and classification

In this article, I summarise Peter Hall's contributions to high-dimensional data, including their geometric representations and variable selection methods based on ranking. I also discuss his work on classification problems, concluding with some personal reflections on my own interactions with him.

Read more
Other Statistics

Picking Winners in Daily Fantasy Sports Using Integer Programming

We consider the problem of selecting a portfolio of entries of fixed cardinality for contests with top-heavy payoff structures, i.e. most of the winnings go to the top-ranked entries. This framework is general and can be used to model a variety of problems, such as movie studios selecting movies to produce, venture capital firms picking start-up companies to invest in, or individuals selecting lineups for daily fantasy sports contests, which is the example we focus on here. We model the portfolio selection task as a combinatorial optimization problem with a submodular objective function, which is given by the probability of at least one entry winning. We then show that this probability can be approximated using only pairwise marginal probabilities of the entries winning when there is a certain structure on their joint distribution. We consider a model where the entries are jointly Gaussian random variables and present a closed form approximation to the objective function. Building on this, we then consider a scenario where the entries are given by sums of constrained resources and present an integer programming formulation to construct the entries. Our formulation uses principles based on our theoretical analysis to construct entries: we maximize the expected score of an entry subject to a lower bound on its variance and an upper bound on its correlation with previously constructed entries. To demonstrate the effectiveness of our integer programming approach, we apply it to daily fantasy sports contests that have top-heavy payoff structures. We find that our approach performs well in practice. Using our integer programming approach, we are able to rank in the top-ten multiple times in hockey and baseball contests with thousands of competing entries. Our approach can easily be extended to other problems with constrained resources and a top-heavy payoff structure.

Read more
Other Statistics

Popper's falsification and corroboration from the statistical perspectives

The role of probability appears unchallenged as the key measure of uncertainty, used among other things for practical induction in the empirical sciences. Yet, Popper was emphatic in his rejection of inductive probability and of the logical probability of hypotheses; furthermore, for him, the degree of corroboration cannot be a probability. Instead he proposed a deductive method of testing. In many ways this dialectic tension has many parallels in statistics, with the Bayesians on logico-inductive side vs the non-Bayesians or the frequentists on the other side. Simplistically Popper seems to be on the frequentist side, but recent synthesis on the non-Bayesian side might direct the Popperian views to a more nuanced destination. Logical probability seems perfectly suited to measure partial evidence or support, so what can we use if we are to reject it? For the past 100 years, statisticians have also developed a related concept called likelihood, which has played a central role in statistical modelling and inference. Remarkably, this Fisherian concept of uncertainty is largely unknown or at least severely under-appreciated in non-statistical literature. As a measure of corroboration, the likelihood satisfies the Popperian requirement that it is not a probability. Our aim is to introduce the likelihood and its recent extension via a discussion of two well-known logical fallacies in order to highlight that its lack of recognition may have led to unnecessary confusion in our discourse about falsification and corroboration of hypotheses. We highlight the 100 years of development of likelihood concepts. The year 2021 will mark the 100-year anniversary of the likelihood, so with this paper we wish it a long life and increased appreciation in non-statistical literature.

Read more

Ready to get started?

Join us today