Featured Researches

Methodology

Partially observed Markov processes with spatial structure via the R package spatPomp

We address inference for a partially observed nonlinear non-Gaussian latent stochastic system comprised of interacting units. Each unit has a state, which may be discrete or continuous, scalar or vector valued. In biological applications, the state may represent a structured population or the abundances of a collection of species at a single location. Units can have spatial locations, allowing the description of spatially distributed interacting populations arising in ecology, epidemiology and elsewhere. We consider models where the collection of states is a latent Markov process, and a time series of noisy or incomplete measurements is made on each unit. A model of this form is called a spatiotemporal partially observed Markov process (SpatPOMP). The R package spatPomp provides an environment for implementing SpatPOMP models, analyzing data, and developing new inference approaches. We describe the spatPomp implementations of some methods with scaling properties suited to SpatPOMP models. We demonstrate the package on a simple Gaussian system and on a nontrivial epidemiological model for measles transmission within and between cities. We show how to construct user-specified SpatPOMP models within spatPomp.

Read more
Methodology

Penalized Maximum Likelihood Estimator for Mixture of von Mises-Fisher Distributions

The von Mises-Fisher distribution is one of the most widely used probability distributions to describe directional data. Finite mixtures of von Mises-Fisher distributions have found numerous applications. However, the likelihood function for the finite mixture of von Mises-Fisher distributions is unbounded and consequently the maximum likelihood estimation is not well defined. To address the problem of likelihood degeneracy, we consider a penalized maximum likelihood approach whereby a penalty function is incorporated. We prove strong consistency of the resulting estimator. An Expectation-Maximization algorithm for the penalized likelihood function is developed and simulation studies are performed to examine its performance.

Read more
Methodology

Performance and Application of Estimators for the Value of an Optimal Dynamic Treatment Rule

Given an (optimal) dynamic treatment rule, it may be of interest to evaluate that rule -- that is, to ask the causal question: what is the expected outcome had every subject received treatment according to that rule? In this paper, we study the performance of estimators that approximate the true value of: 1) an a priori known dynamic treatment rule 2) the true, unknown optimal dynamic treatment rule (ODTR); 3) an estimated ODTR, a so-called "data-adaptive parameter," whose true value depends on the sample. Using simulations of point-treatment data, we specifically investigate: 1) the impact of increasingly data-adaptive estimation of nuisance parameters and/or of the ODTR on performance; 2) the potential for improved efficiency and bias reduction through the use of semiparametric efficient estimators; and, 3) the importance of sample splitting based on CV-TMLE for accurate inference. In the simulations considered, there was very little cost and many benefits to using the cross-validated targeted maximum likelihood estimator (CV-TMLE) to estimate the value of the true and estimated ODTR; importantly, and in contrast to non cross-validated estimators, the performance of CV-TMLE was maintained even when highly data-adaptive algorithms were used to estimate both nuisance parameters and the ODTR. In addition, we apply these estimators for the value of the rule to the "Interventions" Study, an ongoing randomized controlled trial, to identify whether assigning cognitive behavioral therapy (CBT) to criminal justice-involved adults with mental illness using an ODTR significantly reduces the probability of recidivism, compared to assigning CBT in a non-individualized way.

Read more
Methodology

Perturbations and Causality in Gaussian Latent Variable Models

Causal inference is a challenging problem with observational data alone. The task becomes easier when having access to data from perturbing the underlying system, even when happening in a non-randomized way: this is the setting we consider, encompassing also latent confounding variables. To identify causal relations among a collections of covariates and a response variable, existing procedures rely on at least one of the following assumptions: i) the response variable remains unperturbed, ii) the latent variables remain unperturbed, and iii) the latent effects are dense. In this paper, we examine a perturbation model for interventional data, which can be viewed as a mixed-effects linear structural causal model, over a collection of Gaussian variables that does not satisfy any of these conditions. We propose a maximum-likelihood estimator -- dubbed DirectLikelihood -- that exploits system-wide invariances to uniquely identify the population causal structure from unspecific perturbation data, and our results carry over to linear structural causal models without requiring Gaussianity. We illustrate the utility of our framework on synthetic data as well as real data involving California reservoirs and protein expressions.

Read more
Methodology

Posterior Averaging Information Criterion

We propose a new model selection method, the posterior averaging information criterion, for Bayesian model assessment from a predictive perspective. The theoretical foundation is built on the Kullback-Leibler divergence to quantify the similarity between the proposed candidate model and the underlying true model. From a Bayesian perspective, our method evaluates the candidate models over the entire posterior distribution in terms of predicting a future independent observation. Without assuming that the true distribution is contained in the candidate models, the new criterion is developed by correcting the asymptotic bias of the posterior mean of the log-likelihood against its expected log-likelihood. It can be generally applied even for Bayesian models with degenerate non-informative prior. The simulation in both normal and binomial settings demonstrates decent small sample performance.

Read more
Methodology

Principles for Covariate Adjustment in Analyzing Randomized Clinical Trials

In randomized clinical trials, adjustments for baseline covariates at both design and analysis stages are highly encouraged by regulatory agencies. A recent trend is to use a model-assisted approach for covariate adjustment to gain credibility and efficiency while producing asymptotically valid inference even when the model is incorrect. In this article we present three principles for model-assisted inference in simple or covariate-adaptive randomized trials: (1) guaranteed efficiency gain principle, a model-assisted method should often gain but never hurt efficiency; (2) validity and universality principle, a valid procedure should be universally applicable to all commonly used randomization schemes; (3) robust standard error principle, variance estimation should be heteroscedasticity-robust. To fulfill these principles, we recommend a working model that includes all covariates utilized in randomization and all treatment-by-covariate interaction terms. Our conclusions are based on asymptotic theory with a generality that has not appeared in the literature, as most existing results are about linear contrasts of outcomes rather than the joint distribution and most existing inference results under covariate-adaptive randomization are special cases of our theory. Our theory also reveals distinct results between cases of two arms and multiple arms.

Read more
Methodology

Private Tabular Survey Data Products through Synthetic Microdata Generation

We propose three synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a disclosure risk based-weighted pseudo posterior mechanism to survey data with a focus on producing tabular products under a formal privacy guarantee. Two of our approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a probabilistic differential privacy guarantee. The privacy-protected outcome and sampling weights are used to construct tabular cell estimates and associated standard errors to correct for survey sampling bias. The third approach synthesizes the population distribution from the observed sample under a pseudo posterior construction that treats survey sampling weights as fixed to correct the sample likelihood to approximate that for the population. Each by-record sampling weight in the pseudo posterior is, in turn, multiplied by the associated privacy, risk-based weight for that record to create a composite pseudo posterior mechanism that both corrects for survey bias and provides a privacy guarantee for the observed sample. Through a simulation study and a real data application to the Survey of Doctorate Recipients public use file, we demonstrate that our three microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive-noise approach of the Laplace Mechanism. Moreover, all our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost.

Read more
Methodology

Probabilistic Forecasting for Daily Electricity Loads and Quantiles for Curve-to-Curve Regression

Probabilistic forecasting of electricity load curves is of fundamental importance for effective scheduling and decision making in the increasingly volatile and competitive energy markets. We propose a novel approach to construct probabilistic predictors for curves (PPC), which leads to a natural and new definition of quantiles in the context of curve-to-curve linear regression. There are three types of PPC: a predictive set, a predictive band and a predictive quantile, all of which are defined at a pre-specified nominal probability level. In the simulation study, the PPC achieve promising coverage probabilities under a variety of data generating mechanisms. When applying to one day ahead forecasting for the French daily electricity load curves, PPC outperform several state-of-the-art predictive methods in terms of forecasting accuracy, coverage rate and average length of the predictive bands. The predictive quantile curves provide insightful information which is highly relevant to hedging risks in electricity supply management.

Read more
Methodology

Probabilistic Learning on Manifolds (PLoM) with Partition

The probabilistic learning on manifolds (PLoM) introduced in 2016 has solved difficult supervised problems for the ``small data'' limit where the number N of points in the training set is small. Many extensions have since been proposed, making it possible to deal with increasingly complex cases. However, the performance limit has been observed and explained for applications for which N is very small (50 for example) and for which the dimension of the diffusion-map basis is close to N . For these cases, we propose a novel extension based on the introduction of a partition in independent random vectors. We take advantage of this novel development to present improvements of the PLoM such as a simplified algorithm for constructing the diffusion-map basis and a new mathematical result for quantifying the concentration of the probability measure in terms of a probability upper bound. The analysis of the efficiency of this novel extension is presented through two applications.

Read more
Methodology

Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric

We present a novel class of projected methods, to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure, by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed.

Read more

Ready to get started?

Join us today