Featured Researches

Methodology

Matrix-free Penalized Spline Smoothing with Multiple Covariates

The paper motivates high dimensional smoothing with penalized splines and its numerical calculation in an efficient way. If smoothing is carried out over three or more covariates the classical tensor product spline bases explode in their dimension bringing the estimation to its numerical limits. A recent approach by Siebenborn and Wagner(2019) circumvents storage expensive implementations by proposing matrix-free calculations which allows to smooth over several covariates. We extend their approach here by linking penalized smoothing and its Bayesian formulation as mixed model which provides a matrix-free calculation of the smoothing parameter to avoid the use of high-computational cross validation. Further, we show how to extend the ideas towards generalized regression models. The extended approach is applied to remote sensing satellite data in combination with spatial smoothing.

Read more
Methodology

Maximum Entropy classification for record linkage

By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in text mining to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is scalable and fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.

Read more
Methodology

Mixed-type multivariate response regression with covariance estimation

We propose a new method for multivariate response regressions where the elements of the response vector can be of mixed types, for example some continuous and some discrete. Our method is based on a model which assumes the observable mixed-type response vector is connected to a latent multivariate normal response linear regression through a link function. We explore the properties of this model and show its parameters are identifiable under reasonable conditions. We impose no parametric restrictions on the covariance of the latent normal other than positive definiteness, thereby avoiding assumptions about unobservable variables which can be difficult to verify. To accommodate this generality, we propose a novel algorithm for approximate maximum likelihood estimation that works "off-the-shelf" with many different combinations of response types, and which scales well in the dimension of the response vector. Our method typically gives better predictions and parameter estimates than fitting separate models for the different response types and allows for approximate likelihood ratio testing of relevant hypotheses such as independence of responses. The usefulness of the proposed method is illustrated in simulations; and one biomedical and one genomic data example.

Read more
Methodology

Mode Hunting Using Pettiest Components Analysis

Principal component analysis has been used to reduce dimensionality of datasets for a long time. In this paper, we will demonstrate that in mode detection the components of smallest variance, the pettiest components, are more important. We prove that when the data follows a multivariate normal distribution, by implementing "pettiest component analysis" when the data is normally distributed, we obtain boxes of optimal size in the sense that their size is minimal over all possible boxes with the same number of dimensions and given probability. We illustrate our result with a simulation revealing that pettiest component analysis works better than its competitors.

Read more
Methodology

Model Calibration via Distributionally Robust Optimization: On the NASA Langley Uncertainty Quantification Challenge

We study a methodology to tackle the NASA Langley Uncertainty Quantification Challenge, a model calibration problem under both aleatory and epistemic uncertainties. Our methodology is based on an integration of robust optimization, more specifically a recent line of research known as distributionally robust optimization, and importance sampling in Monte Carlo simulation. The main computation machinery in this integrated methodology amounts to solving sampled linear programs. We present theoretical statistical guarantees of our approach via connections to nonparametric hypothesis testing, and numerical performances including parameter calibration and downstream decision and risk evaluation tasks.

Read more
Methodology

Model detection and variable selection for mode varying coefficient model

Varying coefficient model is often used in statistical modeling since it is more flexible than the parametric model. However, model detection and variable selection of varying coefficient model are poorly understood in mode regression. Existing methods in the literature for these problems often based on mean regression and quantile regression. In this paper, we propose a novel method to solve these problems for mode varying coefficient model based on the B-spline approximation and SCAD penalty. Moreover, we present a new algorithm to estimate the parameters of interest, and discuss the parameters selection for the tuning parameters and bandwidth. We also establish the asymptotic properties of estimated coefficients under some regular conditions. Finally, we illustrate the proposed method by some simulation studies and an empirical example.

Read more
Methodology

Model selection for estimation of causal parameters

A popular technique for selecting and tuning machine learning estimators is cross-validation. Cross-validation evaluates overall model fit, usually in terms of predictive accuracy. This may lead to models that exhibit good overall predictive accuracy, but can be suboptimal for estimating causal quantities such as the average treatment effect. We propose a model selection procedure that estimates the mean-squared error of a one-dimensional estimator. The procedure relies on knowing an asymptotically unbiased estimator of the parameter of interest. Under regularity conditions, we show that the proposed criterion has asymptotically equal or lower variance than competing procedures based on sample splitting. In the literature, model selection is often used to choose among models for nuisance parameters but the identification strategy is usually fixed across models. Here, we use model selection to select among estimators that correspond to different estimands. More specifically, we use model selection to shrink between methods such as augmented inverse probability weighting, regression adjustment, the instrumental variables approach, and difference-in-means. The performance of the approach for estimation and inference for average treatment effects is evaluated on simulated data sets, including experimental data, instrumental variables settings, and observational data with selection on observables.

Read more
Methodology

Model structures and structural identifiability: What? Why? How?

We may attempt to encapsulate what we know about a physical system by a model structure, S . This collection of related models is defined by parametric relationships between system features; say observables (outputs), unobservable variables (states), and applied inputs. Each parameter vector in some parameter space is associated with a completely specified model in S . Before choosing a model in S to predict system behaviour, we must estimate its parameters from system observations. Inconveniently, multiple models (associated with distinct parameter estimates) may approximate data equally well. Yet, if these equally valid alternatives produce dissimilar predictions of unobserved quantities, then we cannot confidently make predictions. Thus, our study may not yield any useful result. We may anticipate the non-uniqueness of parameter estimates ahead of data collection by testing S for structural global identifiability (SGI). Here we will provide an overview of the importance of SGI, some essential theory and distinctions, and demonstrate these in testing some examples.

Read more
Methodology

Modeling Spatial Dependence with Cauchy Convolution Processes

We study the class of dependence models for spatial data obtained from Cauchy convolution processes based on different types of kernel functions. We show that the resulting spatial processes have appealing tail dependence properties, such as tail dependence at short distances and independence at long distances with suitable kernel functions. We derive the extreme-value limits of these processes, study their smoothness properties, and detail some interesting special cases. To get higher flexibility at sub-asymptotic levels and separately control the bulk and the tail dependence properties, we further propose spatial models constructed by mixing a Cauchy convolution process with a Gaussian process. We demonstrate that this framework indeed provides a rich class of models for the joint modeling of the bulk and the tail behaviors. Our proposed inference approach relies on matching model-based and empirical summary statistics, and an extensive simulation study shows that it yields accurate estimates. We demonstrate our new methodology by application to a temperature dataset measured at 97 monitoring stations in the state of Oklahoma, US. Our results indicate that our proposed model provides a very good fit to the data, and that it captures both the bulk and the tail dependence structures accurately.

Read more
Methodology

Modeling Treatment Effect Modification in Multidrug-Resistant Tuberculosis in an Individual Patient Data Meta-Analysis

Effect modification occurs while the effect of the treatment is not homogeneous across the different strata of patient characteristics. When the effect of treatment may vary from individual to individual, precision medicine can be improved by identifying patient covariates to estimate the size and direction of the effect at the individual level. However, this task is statistically challenging and typically requires large amounts of data. Investigators may be interested in using the individual patient data (IPD) from multiple studies to estimate these treatment effect models. Our data arise from a systematic review of observational studies contrasting different treatments for multidrug-resistant tuberculosis (MDR-TB), where multiple antimicrobial agents are taken concurrently to cure the infection. We propose a marginal structural model (MSM) for effect modification by different patient characteristics and co-medications in a meta-analysis of observational IPD. We develop, evaluate, and apply a targeted maximum likelihood estimator (TMLE) for the doubly robust estimation of the parameters of the proposed MSM in this context. In particular, we allow for differential availability of treatments across studies, measured confounding within and across studies, and random effects by study.

Read more

Ready to get started?

Join us today