Featured Researches

Methodology

STR: A Seasonal-Trend Decomposition Procedure Based on Regression

We propose two new general methods for decomposing seasonal time series data: STR (a Seasonal-Trend decomposition procedure based on Regression) and Robust STR. In some ways, STR is similar to Ridge Regression, and Robust STR is related to LASSO. These new methods are more general than any other alternative time series decomposition methods; they allow for multiple seasonal and cyclic components, as well as multiple linear covariates with constant, flexible, seasonal, and cyclic influences. The seasonal patterns (for both seasonal components and seasonal covariates) can be fractional and flexible over time; moreover, they can either be strictly periodic or have a more complex topology. We also provide confidence intervals for the estimated components and discuss how STR can be used for forecasting.

Read more
Methodology

Scalable Multiple Changepoint Detection for Functional Data Sequences

We propose the Multiple Changepoint Isolation (MCI) method for detecting multiple changes in the mean and covariance of a functional process. We first introduce a pair of projections to represent the high and low frequency features of the data. We then apply total variation denoising and introduce a new regionalization procedure to split the projections into multiple regions. Denoising and regionalizing act to isolate each changepoint into its own region, so that the classical univariate CUSUM statistic can be applied region-wise to find all changepoints. Simulations show that our method accurately detects the number and locations of changepoints under many different scenarios. These include light and heavy tailed data, data with symmetric and skewed distributions, sparsely and densely sampled changepoints, and both mean and covariance changes. We show that our method outperforms a recent multiple functional changepoint detector and several univariate changepoint detectors applied to our proposed projections. We also show that the MCI is more robust than existing approaches, and scales linearly with sample size. Finally, we demonstrate our method on a large time series of water vapor mixing ratio profiles from atmospheric emitted radiance interferometer measurements.

Read more
Methodology

Selection of Regression Models under Linear Restrictions for Fixed and Random Designs

Many important modeling tasks in linear regression, including variable selection (in which slopes of some predictors are set equal to zero) and simplified models based on sums or differences of predictors (in which slopes of those predictors are set equal to each other, or the negative of each other, respectively), can be viewed as being based on imposing linear restrictions on regression parameters. In this paper, we discuss how such models can be compared using information criteria designed to estimate predictive measures like squared error and Kullback-Leibler (KL) discrepancy, in the presence of either deterministic predictors (fixed-X) or random predictors (random-X). We extend the justifications for existing fixed-X criteria Cp, FPE and AICc, and random-X criteria Sp and RCp, to general linear restrictions. We further propose and justify a KL-based criterion, RAICc, under random-X for variable selection and general linear restrictions. We show in simulations that the use of the KL-based criteria AICc and RAICc results in better predictive performance and sparser solutions than the use of squared error-based criteria, including cross-validation.

Read more
Methodology

Selection of Summary Statistics for Network Model Choice with Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) now serves as one of the major strategies to perform model choice and parameter inference on models with intractable likelihoods. An essential component of ABC involves comparing a large amount of simulated data with the observed data through summary statistics. To avoid the curse of dimensionality, summary statistic selection is of prime importance, and becomes even more critical when applying ABC to mechanistic network models. Indeed, while many summary statistics can be used to encode network structures, their computational complexity can be highly variable. For large networks, computation of summary statistics can quickly create a bottleneck, making the use of ABC difficult. To reduce this computational burden and make the analysis of mechanistic network models more practical, we investigated two questions in a model choice framework. First, we studied the utility of cost-based filter selection methods to account for different summary costs during the selection process. Second, we performed selection using networks generated with a smaller number of nodes to reduce the time required for the selection step. Our findings show that computationally inexpensive summary statistics can be efficiently selected with minimal impact on classification accuracy. Furthermore, we found that networks with a smaller number of nodes can only be employed to eliminate a moderate number of summaries. While this latter finding is network specific, the former is general and can be adapted to any ABC application.

Read more
Methodology

Semi-parametric estimation of biomarker age trends with endogenous medication use in longitudinal data

In cohort studies, non-random medication use can pose barriers to estimation of the natural history trend in a mean biomarker value (namely, the association between a predictor of interest and a biomarker outcome that would be observed in the absence of biomarker-specific treatment). Common causes of treatment and outcomes are often unmeasured, obscuring our ability to easily account for medication use with commonly invoked assumptions such as ignorability. Further, absent some variable satisfying the exclusion restriction, use of instrumental variable approaches may be difficult to justify. Heckman's hybrid model with structural shift (sometimes referred to less specifically as the treatment effects model) can be used to correct endogeneity bias via a homogeneity assumption (i.e., that average treatment effects do not vary across covariates) and parametric specification of a joint model for the outcome and treatment. In recent work, we relaxed the homogeneity assumption by allowing observed covariates to serve as treatment effect modifiers. While this method has been shown to be reasonably robust in settings of cross-sectional data, application of this methodology to settings of longitudinal data remains unexplored. We demonstrate how the assumptions of the treatment effects model can be extended to accommodate clustered data arising from longitudinal studies. Our proposed approach is semi-parametric in nature in that valid inference can be obtained without the need to specify the longitudinal correlation structure. As an illustrative example, we use data from the Multi-Ethnic Study of Atherosclerosis to evaluate trends in low-density lipoprotein by age and gender. We confirm that our generalization of the treatment effects model can serve as a useful tool to uncover natural history trends in longitudinal data that are obscured by endogenous treatment.

Read more
Methodology

Semi-supervised learning and the question of true versus estimated propensity scores

A straightforward application of semi-supervised machine learning to the problem of treatment effect estimation would be to consider data as "unlabeled" if treatment assignment and covariates are observed but outcomes are unobserved. According to this formulation, large unlabeled data sets could be used to estimate a high dimensional propensity function and causal inference using a much smaller labeled data set could proceed via weighted estimators using the learned propensity scores. In the limiting case of infinite unlabeled data, one may estimate the high dimensional propensity function exactly. However, longstanding advice in the causal inference community suggests that estimated propensity scores (from labeled data alone) are actually preferable to true propensity scores, implying that the unlabeled data is actually useless in this context. In this paper we examine this paradox and propose a simple procedure that reconciles the strong intuition that a known propensity functions should be useful for estimating treatment effects with the previous literature suggesting otherwise. Further, simulation studies suggest that direct regression may be preferable to inverse-propensity weight estimators in many circumstances.

Read more
Methodology

Semiparametric counterfactual density estimation

Causal effects are often characterized with averages, which can give an incomplete picture of the underlying counterfactual distributions. Here we consider estimating the entire counterfactual density and generic functionals thereof. We focus on two kinds of target parameters. The first is a density approximation, defined by a projection onto a finite-dimensional model using a generalized distance metric, which includes f-divergences as well as L p norms. The second is the distance between counterfactual densities, which can be used as a more nuanced effect measure than the mean difference, and as a tool for model selection. We study nonparametric efficiency bounds for these targets, giving results for smooth but otherwise generic models and distances. Importantly, we show how these bounds connect to means of particular non-trivial functions of counterfactuals, linking the problems of density and mean estimation. We go on to propose doubly robust-style estimators for the density approximations and distances, and study their rates of convergence, showing they can be optimally efficient in large nonparametric models. We also give analogous methods for model selection and aggregation, when many models may be available and of interest. Our results all hold for generic models and distances, but throughout we highlight what happens for particular choices, such as L 2 projections on linear models, and KL projections on exponential families. Finally we illustrate by estimating the density of CD4 count among patients with HIV, had all been treated with combination therapy versus zidovudine alone, as well as a density effect. Our results suggest combination therapy may have increased CD4 count most for high-risk patients. Our methods are implemented in the freely available R package npcausal on GitHub.

Read more
Methodology

Sensitivity Analysis for Unmeasured Confounding via Effect Extrapolation

Inferring the causal effect of a non-randomly assigned exposure on an outcome requires adjusting for common causes of the exposure and outcome to avoid biased conclusions. Notwithstanding the efforts investigators routinely make to measure and adjust for such common causes (or confounders), some confounders typically remain unmeasured, raising the prospect of biased inference in observational studies. Therefore, it is crucial that investigators can practically assess their substantive conclusions' relative (in)sensitivity to potential unmeasured confounding. In this article, we propose a sensitivity analysis strategy that is informed by the stability of the exposure effect over different, well-chosen subsets of the measured confounders. The proposal entails first approximating the process for recording confounders to learn about how the effect is potentially affected by varying amounts of unmeasured confounding, then extrapolating to the effect had hypothetical unmeasured confounders been additionally adjusted for. A large set of measured confounders can thus be exploited to provide insight into the likely presence of unmeasured confounding bias, albeit under an assumption about how data on the confounders are recorded. The proposal's ability to reveal the true effect and ensure valid inference after extrapolation is empirically compared with existing methods using simulation studies. We demonstrate the procedure using two different publicly available datasets commonly used for causal inference.

Read more
Methodology

Sequential Bayesian Risk Set Inference for Robust Discrete Optimization via Simulation

Optimization via simulation (OvS) procedures that assume the simulation inputs are generated from the real-world distributions are subject to the risk of selecting a suboptimal solution when the distributions are substituted with input models estimated from finite real-world data -- known as input model risk. Focusing on discrete OvS, this paper proposes a new Bayesian framework for analyzing input model risk of implementing an arbitrary solution, x , where uncertainty about the input models is captured by a posterior distribution. We define the α -level risk set of solution x as the set of solutions whose expected performance is better than x by a practically meaningful margin (>δ) given common input models with significant probability ( >α ) under the posterior distribution. The user-specified parameters, δ and α , control robustness of the procedure to the desired level as well as guards against unnecessary conservatism. An empty risk set implies that there is no practically better solution than x with significant probability even though the real-world input distributions are unknown. For efficient estimation of the risk set, the conditional mean performance of a solution given a set of input distributions is modeled as a Gaussian process (GP) that takes the solution-distributions pair as an input. In particular, our GP model allows both parametric and nonparametric input models. We propose the sequential risk set inference procedure that estimates the risk set and selects the next solution-distributions pair to simulate using the posterior GP at each iteration. We show that simulating the pair expected to change the risk set estimate the most in the next iteration is the asymptotic one-step optimal sampling rule that minimizes the number of incorrectly classified solutions, if the procedure runs without stopping.

Read more
Methodology

Sequential Bayesian experimental design for estimation of extreme-event probability in stochastic dynamical systems

We consider a dynamical system with two sources of uncertainties: (1) parameterized input with a known probability distribution and (2) stochastic input-to-response (ItR) function with heteroscedastic randomness. Our purpose is to efficiently quantify the extreme response probability when the ItR function is expensive to evaluate. The problem setup arises often in physics and engineering problems, with randomness in ItR coming from either intrinsic uncertainties (say, as a solution to a stochastic equation) or additional (critical) uncertainties that are not incorporated in the input parameter space. To reduce the required sampling numbers, we develop a sequential Bayesian experimental design method leveraging the variational heteroscedastic Gaussian process regression (VHGPR) to account for the stochastic ItR, along with a new criterion to select the next-best samples sequentially. The validity of our new method is first tested in two synthetic problems with the stochastic ItR functions defined artificially. Finally, we demonstrate the application of our method to an engineering problem of estimating the extreme ship motion probability in ensemble of wave groups, where the uncertainty in ItR naturally originates from the uncertain initial condition of ship motion in each wave group.

Read more

Ready to get started?

Join us today