Featured Researches

Methodology

Divide-and-Conquer MCMC for Multivariate Binary Data

The analysis of large scale medical claims data has the potential to improve quality of care by generating insights which can be used to create tailored medical programs. In particular, the multivariate probit model can be used to investigate the correlation between multiple binary responses of interest in such data, e.g. the presence of multiple chronic conditions. Bayesian modeling is well suited to such analyses because of the automatic uncertainty quantification provided by the posterior distribution. A complicating factor is that large medical claims datasets often do not fit in memory, which renders the estimation of the posterior using traditional Markov Chain Monte Carlo (MCMC) methods computationally infeasible. To address this challenge, we extend existing divide-and-conquer MCMC algorithms to the multivariate probit model, demonstrating, via simulation, that they should be preferred over mean-field variational inference when the estimation of the latent correlation structure between binary responses is of primary interest. We apply this algorithm to a large database of de-identified Medicare Advantage claims from a single large US health insurance provider, where we find medically meaningful groupings of common chronic conditions and asses the impact of the urban-rural health gap by identifying underutilized provider specialties in rural areas.

Read more
Methodology

Docs are ROCs: A simple off-the-shelf approach for estimating average human performance in diagnostic studies

Estimating average human performance has been performed inconsistently in research in diagnostic medicine. This has been particularly apparent in the field of medical artificial intelligence, where humans are often compared against AI models in multi-reader multi-case studies, and commonly reported metrics such as the pooled or average human sensitivity and specificity will systematically underestimate the performance of human experts. We present the use of summary receiver operating characteristic curve analysis, a technique commonly used in the meta-analysis of diagnostic test accuracy studies, as a sensible and methodologically robust alternative. We describe the motivation for using these methods and present results where we apply these meta-analytic techniques to a handful of prominent medical AI studies.

Read more
Methodology

Double bootstrapping for visualising the distribution of descriptive statistics of functional data

We propose a double bootstrap procedure for reducing coverage error in the confidence intervals of descriptive statistics for independent and identically distributed functional data. Through a series of Monte Carlo simulations, we compare the finite sample performance of single and double bootstrap procedures for estimating the distribution of descriptive statistics for independent and identically distributed functional data. At the cost of longer computational time, the double bootstrap with the same bootstrap method reduces confidence level error and provides improved coverage accuracy than the single bootstrap. Illustrated by a Canadian weather station data set, the double bootstrap procedure presents a tool for visualising the distribution of the descriptive statistics for the functional data.

Read more
Methodology

Double/Debiased Machine Learning for Logistic Partially Linear Model

We propose double/debiased machine learning approaches to infer (at the parametric rate) the parametric component of a logistic partially linear model with the binary response following a conditional logistic model of a low dimensional linear parametric function of some key (exposure) covariates and a nonparametric function adjusting for the confounding effect of other covariates. We consider a Neyman orthogonal (doubly robust) score equation consisting of two nuisance functions: nonparametric component in the logistic model and conditional mean of the exposure on the other covariates and with the response fixed. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse parametric models and more general (typically nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first-order bias of the nuisance models and grant our method a model double robustness property in the sense that our estimator achieves the desirable rate when at least one of the nuisance models is correctly specified and both of them are ultra-sparse. In the ML case, the non-linearity of the logit link makes it substantially harder than the partially linear setting to use an arbitrary conditional mean learning algorithm to estimate the nuisance component of the logistic model. We handle this obstacle through a novel full model refitting procedure that is easy-to-implement and facilitates the use of nonparametric ML algorithms in our framework. Our ML estimator is rate doubly robust in the same sense as Chernozhukov et al. (2018a). We evaluate our methods through simulation studies and apply them in assessing the effect of emergency contraceptive (EC) pill on early gestation foetal with a policy reform in Chile in 2008 (Bentancor and Clarke, 2017).

Read more
Methodology

Doubly Robust Semiparametric Inference Using Regularized Calibrated Estimation with High-dimensional Data

Consider semiparametric estimation where a doubly robust estimating function for a low-dimensional parameter is available, depending on two working models. With high-dimensional data, we develop regularized calibrated estimation as a general method for estimating the parameters in the two working models, such that valid Wald confidence intervals can be obtained for the parameter of interest under suitable sparsity conditions if either of the two working models is correctly specified. We propose a computationally tractable two-step algorithm and provide rigorous theoretical analysis which justifies sufficiently fast rates of convergence for the regularized calibrated estimators in spite of sequential construction and establishes a desired asymptotic expansion for the doubly robust estimator. As concrete examples, we discuss applications to partially linear, log-linear, and logistic models and estimation of average treatment effects. Numerical studies in the former three examples demonstrate superior performance of our method, compared with debiased Lasso.

Read more
Methodology

Dynamic sparsity on dynamic regression models

In the present work, we consider variable selection and shrinkage for the Gaussian dynamic linear regression within a Bayesian framework. In particular, we propose a novel method that allows for time-varying sparsity, based on an extension of spike-and-slab priors for dynamic models. This is done by assigning appropriate Markov switching priors for the time-varying coefficients' variances, extending the previous work of Ishwaran and Rao (2005). Furthermore, we investigate different priors, including the common Inverted gamma prior for the process variances, and other mixture prior distributions such as Gamma priors for both the spike and the slab, which leads to a mixture of Normal-Gammas priors (Griffin ad Brown, 2010) for the coefficients. In this sense, our prior can be view as a dynamic variable selection prior which induces either smoothness (through the slab) or shrinkage towards zero (through the spike) at each time point. The MCMC method used for posterior computation uses Markov latent variables that can assume binary regimes at each time point to generate the coefficients' variances. In that way, our model is a dynamic mixture model, thus, we could use the algorithm of Gerlach et al (2000) to generate the latent processes without conditioning on the states. Finally, our approach is exemplified through simulated examples and a real data application.

Read more
Methodology

Efficient Estimation of General Treatment Effects using Neural Networks with A Diverging Number of Confounders

The estimation of causal effects is a primary goal of behavioral, social, economic and biomedical sciences. Under the unconfounded treatment assignment condition, adjustment for confounders requires estimating the nuisance functions relating outcome and/or treatment to confounders. The conventional approaches rely on either a parametric or a nonparametric modeling strategy to approximate the nuisance functions. Parametric methods can introduce serious bias into casual effect estimation due to possible mis-specification, while nonparametric estimation suffers from the "curse of dimensionality". This paper proposes a new unified approach for efficient estimation of treatment effects using feedforward artificial neural networks when the number of covariates is allowed to increase with the sample size. We consider a general optimization framework that includes the average, quantile and asymmetric least squares treatment effects as special cases. Under this unified setup, we develop a generalized optimization estimator for the treatment effect with the nuisance function estimated by neural networks. We further establish the consistency and asymptotic normality of the proposed estimator and show that it attains the semiparametric efficiency bound. The proposed methods are illustrated via simulation studies and a real data application.

Read more
Methodology

Efficient Study Design with Multiple Measurement Instruments

Outcomes from studies assessing exposure often use multiple measurements. In previous work, using a model first proposed by Buonoccorsi (1991), we showed that combining direct (e.g. biomarkers) and indirect (e.g. self-report) measurements provides a more accurate picture of true exposure than estimates obtained when using a single type of measurement. In this article, we propose a valuable tool for efficient design of studies that include both direct and indirect measurements of a relevant outcome. Based on data from a pilot or preliminary study, the tool, which is available online as a shiny app \citep{shinyR}, can be used to compute: (1) the sample size required for a statistical power analysis, while optimizing the percent of participants who should provide direct measures of exposure (biomarkers) in addition to the indirect (self-report) measures provided by all participants; (2) the ideal number of replicates; and (3) the allocation of resources to intervention and control arms. In addition we show how to examine the sensitivity of results to underlying assumptions. We illustrate our analysis using studies of tobacco smoke exposure and nutrition. In these examples, a near-optimal allocation of the resources can be found even if the assumptions are not precise.

Read more
Methodology

Efficient calibration for imperfect epidemic models with applications to the analysis of COVID-19

The estimation of unknown parameters in simulations, also known as calibration, is crucial for practical management of epidemics and prediction of pandemic risk. A simple yet widely used approach is to estimate the parameters by minimizing the sum of the squared distances between actual observations and simulation outputs. It is shown in this paper that this method is inefficient, particularly when the epidemic models are developed based on certain simplifications of reality, also known as imperfect models which are commonly used in practice. To address this issue, a new estimator is introduced that is asymptotically consistent, has a smaller estimation variance than the least squares estimator, and achieves the semiparametric efficiency. Numerical studies are performed to examine the finite sample performance. The proposed method is applied to the analysis of the COVID-19 pandemic for 20 countries based on the SEIR (Susceptible-Exposed-Infectious-Recovered) model with both deterministic and stochastic simulations. The estimation of the parameters, including the basic reproduction number and the average incubation period, reveal the risk of disease outbreaks in each country and provide insights to the design of public health interventions.

Read more
Methodology

Efficient, Doubly Robust Estimation of the Effect of Dose Switching for Switchers in a Randomised Clinical Trial

Motivated by a clinical trial conducted by Janssen Pharmaceuticals in which a flexible dosing regimen is compared to placebo, we evaluate how switchers in the treatment arm (i.e., patients who were switched to the higher dose) would have fared had they been kept on the low dose. This in order to understand whether flexible dosing is potentially beneficial for them. Simply comparing these patients' responses with those of patients who stayed on the low dose is unsatisfactory because the latter patients are usually in a better health condition. Because the available information in the considered trial is too scarce to enable a reliable adjustment, we will instead transport data from a fixed dosing trial that has been conducted concurrently on the same target, albeit not in an identical patient population. In particular, we will propose an estimator which relies on an outcome model and a propensity score model for the association between study and patient characteristics. The proposed estimator is asymptotically unbiased if at least one of both models is correctly specified, and efficient (under the model defined by the restrictions on the propensity score) when both models are correctly specified. We show that the proposed method for using results from an external study is generically applicable in studies where a classical confounding adjustment is not possible due to positivity violation (e.g., studies where switching takes place in a deterministic manner). Monte Carlo simulations and application to the motivating study demonstrate adequate performance.

Read more

Ready to get started?

Join us today