Featured Researches

Methodology

A Basis Approach to Surface Clustering

This paper presents a novel method for clustering surfaces. The proposal involves first using basis functions in a tensor product to smooth the data and thus reduce the dimension to a finite number of coefficients, and then using these estimated coefficients to cluster the surfaces via the k-means algorithm. An extension of the algorithm to clustering tensors is also discussed. We show that the proposed algorithm exhibits the property of strong consistency, with or without measurement errors, in correctly clustering the data as the sample size increases. Simulation studies suggest that the proposed method outperforms the benchmark k-means algorithm which uses the original vectorized data. In addition, an EGG real data example is considered to illustrate the practical application of the proposal.

Read more
Methodology

A Bayesian Approach to Block-Term Tensor Decomposition Model Selection and Computation

The so-called block-term decomposition (BTD) tensor model, especially in its rank- ( L r , L r ,1) version, has been recently receiving increasing attention due to its enhanced ability of representing systems and signals that are composed of \emph{blocks} of rank higher than one, a scenario encountered in numerous and diverse applications. Its uniqueness and approximation have thus been thoroughly studied. Nevertheless, the challenging problem of estimating the BTD model structure, namely the number of block terms and their individual ranks, has only recently started to attract significant attention. In this work, a Bayesian approach is taken to addressing the problem of rank- ( L r , L r ,1) BTD model selection and computation, based on the idea of imposing column sparsity \emph{jointly} on the factors and in a \emph{hierarchical} manner and estimating the ranks as the numbers of factor columns of non-negligible energy. Using variational inference in the proposed probabilistic model results in an iterative algorithm that comprises closed-form updates. Its Bayesian nature completely avoids the ubiquitous in regularization-based methods task of hyper-parameter tuning. Simulation results with synthetic data are reported, which demonstrate the effectiveness of the proposed scheme in terms of both rank estimation and model fitting.

Read more
Methodology

A Bayesian Framework for Generation of Fully Synthetic Mixed Datasets

Much of the micro data used for epidemiological studies contain sensitive measurements on real individuals. As a result, such micro data cannot be published out of privacy concerns, rendering any published statistical analyses on them nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic, high dimensional micro datasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.

Read more
Methodology

A Bayesian Spatial Modeling Approach to Mortality Forecasting

This paper extends Bayesian mortality projection models for multiple populations considering the stochastic structure and the effect of spatial autocorrelation among the observations. We explain high levels of overdispersion according to adjacent locations based on the conditional autoregressive model. In an empirical study, we compare different hierarchical projection models for the analysis of geographical diversity in mortality between the Japanese counties in multiple years, according to age. By a Markov chain Monte Carlo (MCMC) computation, results have demonstrated the flexibility and predictive performance of our proposed model.

Read more
Methodology

A Bayesian Time-Varying Effect Model for Behavioral mHealth Data

The integration of mobile health (mHealth) devices into behavioral health research has fundamentally changed the way researchers and interventionalists are able to collect data as well as deploy and evaluate intervention strategies. In these studies, researchers often collect intensive longitudinal data (ILD) using ecological momentary assessment methods, which aim to capture psychological, emotional, and environmental factors that may relate to a behavioral outcome in near real-time. In order to investigate ILD collected in a novel, smartphone-based smoking cessation study, we propose a Bayesian variable selection approach for time-varying effect models, designed to identify dynamic relations between potential risk factors and smoking behaviors in the critical moments around a quit attempt. We use parameter-expansion and data-augmentation techniques to efficiently explore how the underlying structure of these relations varies over time and across subjects. We achieve deeper insights into these relations by introducing nonparametric priors for regression coefficients that cluster similar effects for risk factors while simultaneously determining their inclusion. Results indicate that our approach is well-positioned to help researchers effectively evaluate, design, and deliver tailored intervention strategies in the critical moments surrounding a quit attempt.

Read more
Methodology

A Bayesian perspective on sampling of alternatives

In this paper, we apply a Bayesian perspective to sampling of alternatives for multinomial logit (MNL) and mixed multinomial logit (MMNL) models. We find three theoretical results -- i) McFadden's correction factor under the uniform sampling protocol can be transferred to the Bayesian context in MNL; ii) the uniform sampling protocol minimises the loss in information on the parameters of interest (i.e. the kernel of the posterior density) and thereby has desirable small sample properties in MNL; and iii) our theoretical results extend to Bayesian MMNL models using data augmentation. Notably, sampling of alternatives in Bayesian MMNL models does not require the inclusion of the additional correction factor, as identified by Guevara and Ben-Akiva (2013a) in classical settings. Accordingly, due to desirable small and large sample properties, uniform sampling is the recommended sampling protocol in MNL and MMNL, irrespective of the estimation framework selected.

Read more
Methodology

A Bayesian spatio-temporal abundance model for surveillance of the opioid epidemic

Opioid misuse is a national epidemic and a significant drug related threat to the United States. While the scale of the problem is undeniable, estimates of the local prevalence of opioid misuse are lacking, despite their importance to policy-making and resource allocation. This is due, in part, to the challenge of directly measuring opioid misuse at a local level. In this paper, we develop a Bayesian hierarchical spatio-temporal abundance model that integrates indirect county-level data on opioid overdose deaths and treatment admissions with state-level survey estimates on prevalence of opioid misuse to estimate the latent county-level prevalence and counts of people who misuse opioids. A simulation study shows that our joint model accurately recovers the latent counts and prevalence and thus overcomes known limitations with identifiability in abundance models with non-replicated observations. We apply our model to county-level surveillance data from the state of Ohio. Our proposed framework can be applied to other applications of small area estimation for hard to reach populations, which is a common occurrence with many health conditions such as those related to illicit behaviors.

Read more
Methodology

A Change-Point Based Control Chart for Detecting Sparse Changes in High-Dimensional Heteroscedastic Data

Because of the curse-of-dimensionality, high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution and complicated dependency among variables such as heteroscedasticity increase uncertainty of estimated parameters, and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high dimension low sample size scenarios (small n, large p). More difficulties appear in detecting and diagnosing abnormal behaviors that are caused by a small set of variables, i.e., sparse changes. In this article, we propose a changepoint based control chart to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed method can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show its robustness to nonnormality and heteroscedasticity. A real data example is used to illustrate the effectiveness of the proposed control chart in high-dimensional applications. Supplementary material and code are provided online.

Read more
Methodology

A Comparison of Single and Multiple Changepoint Techniques for Time Series Data

This paper describes and compares several prominent single and multiple changepoint techniques for time series data. Due to their importance in inferential matters, changepoint research on correlated data has accelerated recently. Unfortunately, small perturbations in model assumptions can drastically alter changepoint conclusions; for example, heavy positive correlation in a time series can be misattributed to a mean shift should correlation be ignored. This paper considers both single and multiple changepoint techniques. The paper begins by examining cumulative sum (CUSUM) and likelihood ratio tests and their variants for the single changepoint problem; here, various statistics, boundary cropping scenarios, and scaling methods (e.g., scaling to an extreme value or Brownian Bridge limit) are compared. A recently developed test based on summing squared CUSUM statistics over all times is shown to have realistic Type I errors and superior detection power. The paper then turns to the multiple changepoint setting. Here, penalized likelihoods drive the discourse, with AIC, BIC, mBIC, and MDL penalties being considered. Binary and wild binary segmentation techniques are also compared. We introduce a new distance metric specifically designed to compare two multiple changepoint segmentations. Algorithmic and computational concerns are discussed and simulations are provided to support all conclusions. In the end, the multiple changepoint setting admits no clear methodological winner, performance depending on the particular scenario. Nonetheless, some practical guidance will emerge.

Read more
Methodology

A Family of Mixture Models for Biclustering

Biclustering is used for simultaneous clustering of the observations and variables when there is no group structure known \textit{a priori}. It is being increasingly used in bioinformatics, text analytics, etc. Previously, biclustering has been introduced in a model-based clustering framework by utilizing a structure similar to a mixture of factor analyzers. In such models, observed variables X are modelled using a latent variable U that is assumed to be from N(0,I) . Clustering of variables is introduced by imposing constraints on the entries of the factor loading matrix to be 0 and 1 that results in a block diagonal covariance matrices. However, this approach is overly restrictive as off-diagonal elements in the blocks of the covariance matrices can only be 1 which can lead to unsatisfactory model fit on complex data. Here, the latent variable U is assumed to be from a N(0,T) where T is a diagonal matrix. This ensures that the off-diagonal terms in the block matrices within the covariance matrices are non-zero and not restricted to be 1. This leads to a superior model fit on complex data. A family of models are developed by imposing constraints on the components of the covariance matrix. For parameter estimation, an alternating expectation conditional maximization (AECM) algorithm is used. Finally, the proposed method is illustrated using simulated and real datasets.

Read more

Ready to get started?

Join us today