Featured Researches

Other Statistics

A Role for Symmetry in the Bayesian Solution of Differential Equations

The interpretation of numerical methods, such as finite difference methods for differential equations, as point estimators suggests that formal uncertainty quantification can also be performed in this context. Competing statistical paradigms can be considered and Bayesian probabilistic numerical methods (PNMs) are obtained when Bayesian statistical principles are deployed. Bayesian PNM have the appealing property of being closed under composition, such that uncertainty due to different sources of discretisation in a numerical method can be jointly modelled and rigorously propagated. Despite recent attention, no exact Bayesian PNM for the numerical solution of ordinary differential equations (ODEs) has been proposed. This raises the fundamental question of whether exact Bayesian methods for (in general nonlinear) ODEs even exist. The purpose of this paper is to provide a positive answer for a limited class of ODE. To this end, we work at a foundational level, where a novel Bayesian PNM is proposed as a proof-of-concept. Our proposal is a synthesis of classical Lie group methods, to exploit underlying symmetries in the gradient field, and non-parametric regression in a transformed solution space for the ODE. The procedure is presented in detail for first and second order ODEs and relies on a certain strong technical condition -- existence of a solvable Lie algebra -- being satisfied. Numerical illustrations are provided.

Read more
Other Statistics

A Statistical Significance Simulation Study for the General Scientist

When a scientist performs an experiment they normally acquire a set of measurements and are expected to demonstrate that their results are "statistically significant" thus confirming whatever hypothesis they are testing. The main method for establishing statistical significance involves demonstrating that there is a low probability that the observed experimental results were the product of random chance. This is typically defined as p < 0.05, which indicates there is less than a 5% chance that the observed results occurred randomly. This research study visually demonstrates that the commonly used definition for "statistical significance" can erroneously imply a significant finding. This is demonstrated by generating random Gaussian noise data and analyzing that data using statistical testing based on the established two-sample t-test. This study demonstrates that insignificant yet "statistically significant" findings are possible at moderately large sample sizes which are very common in many fields of modern science.

Read more
Other Statistics

A Step by Step Mathematical Derivation and Tutorial on Kalman Filters

We present a step by step mathematical derivation of the Kalman filter using two different approaches. First, we consider the orthogonal projection method by means of vector-space optimization. Second, we derive the Kalman filter using Bayesian optimal filtering. We provide detailed proofs for both methods and each equation is expanded in detail.

Read more
Other Statistics

A Text Mining Discovery of Similarities and Dissimilarities Among Sacred Scriptures

The careful examination of sacred texts gives valuable insights into human psychology, different ideas regarding the organization of societies as well as into terms like truth and God. To improve and deepen our understanding of sacred texts, their comparison, and their separation is crucial. For this purpose, we use our data set has nine sacred scriptures. This work deals with the separation of the Quran, the Asian scriptures Tao-Te-Ching, the Buddhism, the Yogasutras, and the Upanishads as well as the four books from the Bible, namely the Book of Proverbs, the Book of Ecclesiastes, the Book of Ecclesiasticus, and the Book of Wisdom. These scriptures are analyzed based on the natural language processing NLP creating the mathematical representation of the corpus in terms of frequencies called document term matrix (DTM). After this analysis, machine learning methods like supervised and unsupervised learning are applied to perform classification. Here we use the Multinomial Naive Bayes (MNB), the Super Vector Machine (SVM), the Random Forest (RF), and the K-nearest Neighbors (KNN). We obtain that among these methods MNB is able to predict the class of a sacred text with an accuracy of about 85.84 %.

Read more
Other Statistics

A brief history of long memory: Hurst, Mandelbrot and the road to ARFIMA

Long memory plays an important role in many fields by determining the behaviour and predictability of systems; for instance, climate, hydrology, finance, networks and DNA sequencing. In particular, it is important to test if a process is exhibiting long memory since that impacts the accuracy and confidence with which one may predict future events on the basis of a small amount of historical data. A major force in the development and study of long memory was the late Benoit B. Mandelbrot. Here we discuss the original motivation of the development of long memory and Mandelbrot's influence on this fascinating field. We will also elucidate the sometimes contrasting approaches to long memory in different scientific communities

Read more
Other Statistics

A brief history of the Fail Safe Number in Applied Research

Rosenthal's (1979) Fail-Safe-Number (FSN) is probably one of the best known statistics in the context of meta-analysis aimed to estimate the number of unpublished studies in meta-analyses required to bring the meta-analytic mean effect size down to a statistically insignificant level. Already before Scargle's (2000) and Schonemann & Scargle's (2008) fundamental critique on the claimed stability of the basic rationale of the FSN approach, objections focusing on the basic assumption of the FSN which treats the number of studies as unbiased with averaging null were expressed throughout the history of the FSN by different authors (Elashoff, 1978; Iyengar & Greenhouse, 1988a; 1988b; see also Scargle, 2000). In particular, Elashoff's objection appears to be important because it was the very first critique pointing directly to the central problem of the FSN: "R & R claim that the number of studies hidden in the drawers would have to be 65,000 to achieve a mean effect size of zero when combined with the 345 studies reviewed here. But surely, if we allowed the hidden studies to be negative, on the average no more than 345 hidden studies would be necessary to obtain a zero mean effect size" (p. 392). Thus, users of meta-analysis could have been aware right from the beginning that something was wrong with the statistical reasoning of the FSN. In particular, from an applied research perspective, it is therefore of interest whether any of the fundamental objections on the FSN are reflected in standard handbooks on meta-analysis as well as -and of course even more importantly- in meta-analytic studies itself.

Read more
Other Statistics

A comparative study of scoring systems by simulations

Scoring rules aggregate individual rankings by assigning some points to each position in each ranking such that the total sum of points provides the overall ranking of the alternatives. They are widely used in sports competitions consisting of multiple contests. We study the tradeoff between two risks in this setting: (1) the threat of early clinch when the title has been clinched before the last contest(s) of the competition take place; (2) the danger of winning the competition without finishing first in any contest. In particular, four historical points scoring systems of the Formula One World Championship are compared with the family of geometric scoring rules that have favourable axiomatic properties. The formers are found to be competitive or even better. The current scheme seems to be a reasonable compromise in optimising the above goals. Our results shed more light on the evolution of the Formula One points scoring systems and contribute to the issue of choosing the set of point values.

Read more
Other Statistics

A divergence formula for regularization methods with an L2 constraint

We derive a divergence formula for a group of regularization methods with an L2 constraint. The formula is useful for regularization parameter selection, because it provides an unbiased estimate for the number of degrees of freedom. We begin with deriving the formula for smoothing splines and then extend it to other settings such as penalized splines, ridge regression, and functional linear regression.

Read more
Other Statistics

A few statistical principles for data science

In any other circumstance, it might make sense to define the extent of the terrain (Data Science) first, and then locate and describe the landmarks (Principles). But this data revolution we are experiencing defies a cadastral survey. Areas are continually being annexed into Data Science. For example, biometrics was traditionally statistics for agriculture in all its forms but now, in Data Science, it means the study of characteristics that can be used to identify an individual. Examples of non-intrusive measurements include height, weight, fingerprints, retina scan, voice, photograph/video (facial landmarks and facial expressions), and gait. A multivariate analysis of such data would be a complex project for a statistician, but a software engineer might appear to have no trouble with it at all. In any applied-statistics project, the statistician worries about uncertainty and quantifies it by modelling data as realisations generated from a probability space. Another approach to uncertainty quantification is to find similar data sets, and then use the variability of results between these data sets to capture the uncertainty. Both approaches allow 'error bars' to be put on estimates obtained from the original data set, although the interpretations are different. A third approach, that concentrates on giving a single answer and gives up on uncertainty quantification, could be considered as Data Engineering, although it has staked a claim in the Data Science terrain. This article presents a few (actually nine) statistical principles for data scientists that have helped me, and continue to help me, when I work on complex interdisciplinary projects.

Read more
Other Statistics

A flexible observed factor model with separate dynamics for the factor volatilities and their correlation matrix

Our article considers a regression model with observed factors. The observed factors have a flexible stochastic volatility structure that has separate dynamics for the volatilities and the correlation matrix. The correlation matrix of the factors is time-varying and its evolution is described by an inverse Wishart process. The model specifies the evolution of the observed volatilities flexibly and is particularly attractive when the dimension of the observations is high. A Markov chain Monte Carlo algorithm is developed to estimate the model. It is straightforward to use this algorithm to obtain the predictive distributions of future observations and to carry out model selection. The model is illustrated and compared to other Wishart-type factor multivariate stochastic volatility models using various empirical data including monthly stock returns and portfolio weighted returns. The evidence suggests that our model has better predictive performance. The paper also allows the idiosyncratic errors to follow individual stochastic volatility processes in order to deal with more volatile data such as daily or weekly stock returns.

Read more

Ready to get started?

Join us today