Featured Researches

Other Statistics

Hyak Mortality Monitoring System: Innovative Sampling and Estimation Methods - Proof of Concept by Simulation

Traditionally health statistics are derived from civil and/or vital registration. Civil registration in low-income countries varies from partial coverage to essentially nothing at all. Consequently the state of the art for public health information in low-income countries is efforts to combine or triangulate data from different sources to produce a more complete picture across both time and space - data amalgamation. Data sources amenable to this approach include sample surveys, sample registration systems, health and demographic surveillance systems, administrative records, census records, health facility records and others. We propose a new statistical framework for gathering health and population data - Hyak - that leverages the benefits of sampling and longitudinal, prospective surveillance to create a cheap, accurate, sustainable monitoring platform. Hyak has three fundamental components: 1) Data Amalgamation: a sampling and surveillance component that organizes two or more data collection systems to work together: a) data from HDSS with frequent, intense, linked, prospective follow-up and b) data from sample surveys conducted in large areas surrounding the Health and Demographic Surveillance System sites using informed sampling so as to capture as many events as possible; 2) Cause of Death: verbal autopsy to characterize the distribution of deaths by cause at the population level; and 3) SES: measurement of socioeconomic status in order to characterize poverty and wealth. We conduct a simulation study of the informed sampling component of Hyak based on the Agincourt HDSS site in South Africa. Compared to traditional cluster sampling, Hyak's informed sampling captures more deaths, and when combined with an estimation model that includes spatial smoothing, produces estimates mortality that have lower variance and small bias.

Read more
Other Statistics

HyperTools: A Python toolbox for visualizing and manipulating high-dimensional data

Data visualizations can reveal trends and patterns that are not otherwise obvious from the raw data or summary statistics. While visualizing low-dimensional data is relatively straightforward (for example, plotting the change in a variable over time as (x,y) coordinates on a graph), it is not always obvious how to visualize high-dimensional datasets in a similarly intuitive way. Here we present HypeTools, a Python toolbox for visualizing and manipulating large, high-dimensional datasets. Our primary approach is to use dimensionality reduction techniques (Pearson, 1901; Tipping & Bishop, 1999) to embed high-dimensional datasets in a lower-dimensional space, and plot the data using a simple (yet powerful) API with many options for data manipulation [e.g. hyperalignment (Haxby et al., 2011), clustering, normalizing, etc.] and plot styling. The toolbox is designed around the notion of data trajectories and point clouds. Just as the position of an object moving through space can be visualized as a 3D trajectory, HyperTools uses dimensionality reduction algorithms to create similar 2D and 3D trajectories for time series of high-dimensional observations. The trajectories may be plotted as interactive static plots or visualized as animations. These same dimensionality reduction and alignment algorithms can also reveal structure in static datasets (e.g. collections of observations or attributes). We present several examples showcasing how using our toolbox to explore data through trajectories and low-dimensional embeddings can reveal deep insights into datasets across a wide variety of domains.

Read more
Other Statistics

Hyperspectral Data Analysis in R: the hsdar Package

Hyperspectral remote sensing is a promising tool for a variety of applications including ecology, geology, analytical chemistry and medical research. This article presents the new \hsdar package for R statistical software, which performs a variety of analysis steps taken during a typical hyperspectral remote sensing approach. The package introduces a new class for efficiently storing large hyperspectral datasets such as hyperspectral cubes within R. The package includes several important hyperspectral analysis tools such as continuum removal, normalized ratio indices and integrates two widely used radiation transfer models. In addition, the package provides methods to directly use the functionality of the caret package for machine learning tasks. Two case studies demonstrate the package's range of functionality: First, plant leaf chlorophyll content is estimated and second, cancer in the human larynx is detected from hyperspectral data.

Read more
Other Statistics

I can see clearly now: reinterpreting statistical significance

Null hypothesis significance testing remains popular despite decades of concern about misuse and misinterpretation. We believe that much of the problem is due to language: significance testing has little to do with other meanings of the word "significance". Despite the limitations of null-hypothesis tests, we argue here that they remain useful in many contexts as a guide to whether a certain effect can be seen clearly in that context (e.g. whether we can clearly see that a correlation or between-group difference is positive or negative). We therefore suggest that researchers describe the conclusions of null-hypothesis tests in terms of statistical "clarity" rather than statistical "significance". This simple semantic change could substantially enhance clarity in statistical communication.

Read more
Other Statistics

I hear, I forget. I do, I understand: a modified Moore-method mathematical statistics course

Moore introduced a method for graduate mathematics instruction that consisted primarily of individual student work on challenging proofs (Jones, 1977). Cohen (1982) described an adaptation with less explicit competition suitable for undergraduate students at a liberal arts college. This paper details an adaptation of this modified Moore-method to teach mathematical statistics, and describes ways that such an approach helps engage students and foster the teaching of statistics. Groups of students worked a set of 3 difficult problems (some theoretical, some applied) every two weeks. Class time was devoted to coaching sessions with the instructor, group meeting time, and class presentations. R was used to estimate solutions empirically where analytic results were intractable, as well as to provide an environment to undertake simulation studies with the aim of deepening understanding and complementing analytic solutions. Each group presented comprehensive solutions to complement oral presentations. Development of parallel techniques for empirical and analytic problem solving was an explicit goal of the course, which also attempted to communicate ways that statistics can be used to tackle interesting problems. The group problem solving component and use of technology allowed students to attempt much more challenging questions than they could otherwise solve.

Read more
Other Statistics

Identifiability and testability in GRT with Individual Differences

Silbert and Thomas (2013) showed that failures of decisional separability are not, in general, identifiable in fully parameterized 2×2 Gaussian GRT models. A recent extension of 2×2 GRT models (GRTwIND) was developed to solve this problem and a conceptually similar problem with the simultaneous identifiability of means and marginal variances in GRT models. Central to the ability of GRTwIND to solve these problems is the assumption of universal perception, which consists of shared perceptual distributions modified by attentional and global scaling parameters (Soto et al., 2015). If universal perception is valid, GRTwIND solves both issues. In this paper, we show that GRTwIND with universal perception and subject-specific failures of decisional separability is mathematically, and thereby empirically, equivalent to a model with decisional separability and failure of universal perception. We then provide a formal proof of the fact that means and marginal variances are not, in general, simultaneously identifiable in 2×2 GRT models, including GRTwIND. These results can be taken to delineate precisely what the assumption of universal perception must consist of. Based on these results and related recent mathematical developments in the GRT framework, we propose that, in addition to requiring a fixed subset of parameters to determine the location and scale of any given GRT model, some subset of parameters must be set in GRT models to fix the orthogonality of the modeled perceptual dimensions, a central conceptual underpinning of the GRT framework. We conclude with a discussion of perceptual primacy and its relationship to universal perception.

Read more
Other Statistics

Implementation of the Bin Hierarchy Method for restoring a smooth function from a sampled histogram

We present BHM , a tool for restoring a smooth function from a sampled histogram using the bin hierarchy method. The theoretical background of the method is presented in [arXiv:1707.07625]. The code automatically generates a smooth polynomial spline with the minimal acceptable number of knots from the input data. It works universally for any sufficiently regular shaped distribution and any level of data quality, requiring almost no external parameter specification. It is particularly useful for large-scale numerical data analysis. This paper explains the details of the implementation and the use of the program.

Read more
Other Statistics

Improving non-deterministic uncertainty modelling in Industry 4.0 scheduling

The latest Industrial revolution has helped industries in achieving very high rates of productivity and efficiency. It has introduced data aggregation and cyber-physical systems to optimize planning and scheduling. Although, uncertainty in the environment and the imprecise nature of human operators are not accurately considered for into the decision making process. This leads to delays in consignments and imprecise budget estimations. This widespread practice in the industrial models is flawed and requires rectification. Various other articles have approached to solve this problem through stochastic or fuzzy set model methods. This paper presents a comprehensive method to logically and realistically quantify the non-deterministic uncertainty through probabilistic uncertainty modelling. This method is applicable on virtually all Industrial data sets, as the model is self adjusting and uses epsilon-contamination to cater to limited or incomplete data sets. The results are numerically validated through an Industrial data set in Flanders, Belgium. The data driven results achieved through this robust scheduling method illustrate the improvement in performance.

Read more
Other Statistics

Impugning Randomness, Convincingly

John organized a state lottery and his wife won the main prize. You may feel that the event of her winning wasn't particularly random, but how would you argue that in a fair court of law? Traditional probability theory does not even have the notion of random events. Algorithmic information theory does, but it is not applicable to real-world scenarios like the lottery one. We attempt to rectify that.

Read more
Other Statistics

In praise of the referee

There has been a lively debate in many fields, including statistics and related applied fields such as psychology and biomedical research, on possible reforms of the scholarly publishing system. Currently, referees contribute so much to improve scientific papers, both directly through constructive criticism and indirectly through the threat of rejection. We discuss ways in which new approaches to journal publication could continue to make use of the valuable efforts of peer reviewers.

Read more

Ready to get started?

Join us today