Featured Researches

Methodology

Community models for networks observed through edge nominations

Communities are a common and widely studied structure in networks, typically under the assumption that the network is fully and correctly observed. In practice, network data are often collected by querying nodes about their connections. In some settings, all edges of a sampled node will be recorded, and in others, a node may be asked to name its connections. These sampling mechanisms introduce noise and bias which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on recording edges via querying nodes, designed to improve community detection for network data collected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be performed by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. Both spectral clustering and the method of moments in this case are computationally efficient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on both unweighted and weighted networks and apply it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.

Read more
Methodology

Comparing methods addressing multi-collinearity when developing prediction models

Clinical prediction models are developed widely across medical disciplines. When predictors in such models are highly collinear, unexpected or spurious predictor-outcome associations may occur, thereby potentially reducing face-validity and explainability of the prediction model. Collinearity can be dealt with by exclusion of collinear predictors, but when there is no a priori motivation (besides collinearity) to include or exclude specific predictors, such an approach is arbitrary and possibly inappropriate. We compare different methods to address collinearity, including shrinkage, dimensionality reduction, and constrained optimization. The effectiveness of these methods is illustrated via simulations. In the conducted simulations, no effect of collinearity was observed on predictive outcomes. However, a negative effect of collinearity on the stability of predictor selection was found, affecting all compared methods, but in particular methods that perform strong predictor selection (e.g., Lasso).}

Read more
Methodology

Composite Estimation for Quantile Regression Kink Models with Longitudinal Data

Kink model is developed to analyze the data where the regression function is twostage linear but intersects at an unknown threshold. In quantile regression with longitudinal data, previous work assumed that the unknown threshold parameters or kink points are heterogeneous across different quantiles. However, the location where kink effect happens tend to be the same across different quantiles, especially in a region of neighboring quantile levels. Ignoring such homogeneity information may lead to efficiency loss for estimation. In view of this, we propose a composite estimator for the common kink point by absorbing information from multiple quantiles. In addition, we also develop a sup-likelihood-ratio test to check the kink effect at a given quantile level. A test-inversion confidence interval for the common kink point is also developed based on the quantile rank score test. The simulation study shows that the proposed composite kink estimator is more competitive with the least square estimator and the single quantile estimator. We illustrate the practical value of this work through the analysis of a body mass index and blood pressure data set.

Read more
Methodology

Compositionally-warped additive mixed modeling for a wide variety of non-Gaussian spatial data

As with the advancement of geographical information systems, non-Gaussian spatial data sets are getting larger and more diverse. This study develops a general framework for fast and flexible non-Gaussian regression, especially for spatial/spatiotemporal modeling. The developed model, termed the compositionally-warped additive mixed model (CAMM), combines an additive mixed model (AMM) and the compositionally-warped Gaussian process to model a wide variety of non-Gaussian continuous data including spatial and other effects. A specific advantage of the proposed CAMM is that it requires no explicit assumption of data distribution unlike existing AMMs. Monte Carlo experiments show the estimation accuracy and computational efficiency of CAMM for modeling non-Gaussian data including fat-tailed and/or skewed distributions. Finally, the model is applied to crime data to examine the empirical performance of the regression analysis and prediction. The result shows that CAMM provides intuitively reasonable coefficient estimates and outperforms AMM in terms of prediction accuracy. CAMM is verified to be a fast and flexible model that potentially covers a wide variety of non-Gaussian data modeling. The proposed approach is implemented in an R package spmoran.

Read more
Methodology

Computational methods for Bayesian semiparametric Item Response Theory models

Item response theory (IRT) models are widely used to obtain interpretable inference when analyzing data from questionnaires, scaling binary responses into continuous constructs. Typically, these models rely on a normality assumption for the latent trait characterizing individuals in the population under study. However, this assumption can be unrealistic and lead to biased results. We relax the normality assumption by considering a flexible Dirichlet Process mixture model as a nonparametric prior on the distribution of the individual latent traits. Although this approach has been considered in the literature before, there is a lack of comprehensive studies of such models or general software tools. To fill this gap, we show how the NIMBLE framework for hierarchical statistical modeling enables the use of flexible priors on the latent trait distribution, specifically illustrating the use of Dirichlet Process mixtures in two-parameter logistic (2PL) IRT models. We study how different sets of constraints can lead to model identifiability and give guidance on eliciting prior distributions. Using both simulated and real-world data, we conduct an in-depth study of Markov chain Monte Carlo posterior sampling efficiency for several sampling strategies. We conclude that having access to semiparametric models can be broadly useful, as it allows inference on the entire underlying ability distribution and its functionals, with NIMBLE being a flexible framework for estimation of such models.

Read more
Methodology

Computationally Efficient Bayesian Unit-Level Models for Non-Gaussian Data Under Informative Sampling

Statistical estimates from survey samples have traditionally been obtained via design-based estimators. In many cases, these estimators tend to work well for quantities such as population totals or means, but can fall short as sample sizes become small. In today's "information age," there is a strong demand for more granular estimates. To meet this demand, using a Bayesian pseudo-likelihood, we propose a computationally efficient unit-level modeling approach for non-Gaussian data collected under informative sampling designs. Specifically, we focus on binary and multinomial data. Our approach is both multivariate and multiscale, incorporating spatial dependence at the area-level. We illustrate our approach through an empirical simulation study and through a motivating application to health insurance estimates using the American Community Survey.

Read more
Methodology

Computationally Efficient Deep Bayesian Unit-Level Modeling of Survey Data under Informative Sampling for Small Area Estimation

The topic of deep learning has seen a surge of interest in recent years both within and outside of the field of Statistics. Deep models leverage both nonlinearity and interaction effects to provide superior predictions in many cases when compared to linear or generalized linear models. However, one of the main challenges with deep modeling approaches is quantification of uncertainty. The use of random weight models, such as the popularized "Extreme Learning Machine," offer a potential solution in this regard. In addition to uncertainty quantification, these models are extremely computationally efficient as they do not require optimization through stochastic gradient descent, which is what is typically done for deep learning. We show how the use of random weights in a deep model can fit into a likelihood based framework to allow for uncertainty quantification of the model parameters and any desired estimates. Furthermore, we show how this approach can be used to account for informative sampling of survey data through the use of a pseudo-likelihood. We illustrate the effectiveness of this methodology through simulation and with a real survey data application involving American National Election Studies data.

Read more
Methodology

Computing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples

We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) method. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. Whereas BC requires optimal tuning of a bin-width hyper-parameter whose value varies with sample size and shape of the pdf, QS requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantile-spacings is a fixed fraction of the sample size (empirically determined to be ~0.25-0.35), and that this value is relatively insensitive to distributional form or sample size, providing a clear advantage over BC since hyperparameter tuning is not required. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (Gaussian, Log-Normal, Exponential and Bimodal Gaussian Mixture), expected estimation bias is less than 1% and uncertainty is relatively low even for very small sample sizes. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.

Read more
Methodology

Conceptualising Natural and Quasi Experiments in Public Health

Background: Natural or quasi experiments are appealing for public health research because they enable the evaluation of events or interventions that are difficult or impossible to manipulate experimentally, such as many policy and health system reforms. However, there remains ambiguity in the literature about their definition and how they differ from randomised controlled experiments and from other observational designs. Methods: We conceptualise natural experiments in in the context of public health evaluations, align the study design to the Target Trial Framework, and provide recommendation for improvement of their design and reporting. Results: Natural experiment studies combine features of experiments and non-experiments. They differ from RCTs in that exposure allocation is not controlled by researchers while they differ from other observational designs in that they evaluate the impact of event or exposure changes. As a result they are, in theory, less susceptible to bias than other observational study designs. Importantly, the strength of causal inferences relies on the plausibility that the exposure allocation can be considered "as-if randomised". The target trial framework provides a systematic basis for assessing the plausibility of such claims, and enables a structured method for assessing other design elements. Conclusions: Natural experiment studies should be considered a distinct study design rather than a set of tools for analyses of non-randomised interventions. Alignment of natural experiments to the Target Trial framework will clarify the strength of evidence underpinning claims about the effectiveness of public health interventions.

Read more
Methodology

Conditional As-If Analyses in Randomized Experiments

The injunction to `analyze the way you randomize' is well-known to statisticians since Fisher advocated for randomization as the basis of inference. Yet even those convinced by the merits of randomization-based inference seldom follow this injunction to the letter. Bernoulli randomized experiments are often analyzed as completely randomized experiments, and completely randomized experiments are analyzed as if they had been stratified; more generally, it is not uncommon to analyze an experiment as if it had been randomized differently. This paper examines the theoretical foundation behind this practice within a randomization-based framework. Specifically, we ask when is it legitimate to analyze an experiment randomized according to one design as if it had been randomized according to some other design. We show that a sufficient condition for this type of analysis to be valid is that the design used for analysis be derived from the original design by an appropriate form of conditioning. We use our theory to justify certain existing methods, question others, and finally suggest new methodological insights such as conditioning on approximate covariate balance.

Read more

Ready to get started?

Join us today