Featured Researches

Methodology

Moving sum data segmentation for stochastics processes based on invariance

The segmentation of data into stationary stretches also known as multiple change point problem is important for many applications in time series analysis as well as signal processing. Based on strong invariance principles, we analyse data segmentation methodology using moving sum (MOSUM) statistics for a class of regime-switching multivariate processes where each switch results in a change in the drift. In particular, this framework includes the data segmentation of multivariate partial sum, integrated diffusion and renewal processes even if the distance between change points is sublinear. We study the asymptotic behaviour of the corresponding change point estimators, show consistency and derive the corresponding localisation rates which are minimax optimal in a variety of situations including an unbounded number of changes in Wiener processes with drift. Furthermore, we derive the limit distribution of the change point estimators for local changes - a result that can in principle be used to derive confidence intervals for the change points.

Read more
Methodology

Multi-Block Sparse Functional Principal Components Analysis for Longitudinal Microbiome Multi-Omics Data

Microbiome researchers often need to model the temporal dynamics of multiple complex, nonlinear outcome trajectories simultaneously. This motivates our development of multivariate Sparse Functional Principal Components Analysis (mSFPCA), extending existing SFPCA methods to simultaneously characterize multiple temporal trajectories and their inter-relationships. As with existing SFPCA methods, the mSFPCA algorithm characterizes each trajectory as a smooth mean plus a weighted combination of the smooth major modes of variation about the mean, where the weights are given by the component scores for each subject. Unlike existing SFPCA methods, the mSFPCA algorithm allows estimation of multiple trajectories simultaneously, such that the component scores, which are constrained to be independent within a particular outcome for identifiability, may be arbitrarily correlated with component scores for other outcomes. A Cholesky decomposition is used to estimate the component score covariance matrix efficiently and guarantee positive semi-definiteness given these constraints. Mutual information is used to assess the strength of marginal and conditional temporal associations across outcome trajectories. Importantly, we implement mSFPCA as a Bayesian algorithm using R and stan, enabling easy use of packages such as PSIS-LOO for model selection and graphical posterior predictive checks to assess the validity of mSFPCA models. Although we focus on application of mSFPCA to microbiome data in this paper, the mSFPCA model is of general utility and can be used in a wide range of real-world applications.

Read more
Methodology

Multi-Regularization Reconstruction of One-Dimensional T 2 Distributions in Magnetic Resonance Relaxometry with a Gaussian Basis

We consider the inverse problem of recovering the probability distribution function of T 2 relaxation times from NMR transverse relaxometry experiments. This problem is a variant of the inverse Laplace transform and hence ill-posed. We cast this within the framework of a Gaussian mixture model to obtain a least-square problem with an L 2 regularization term. We propose a new method for incorporating regularization into the solution; rather than seeking to replace the native problem with a suitable mathematically close, regularized, version, we instead augment the native formulation with regularization. We term this new approach 'multi-regularization'; it avoids the treacherous process of selecting a single best regularization parameter λ and instead permits incorporation of several degrees of regularization into the solution. We illustrate the method with extensive simulation results as well as application to real experimental data.

Read more
Methodology

Multi-output calibration of a honeycomb seal via on-site surrogates

We consider large-scale industrial computer model calibration, combining multi-output simulation with limited physical observation, involved in the development of a honeycomb seal. Toward that end, we adopt a localized sampling and emulation strategy called "on-site surrogates (OSSs)", designed to cope with the amalgamated challenges of high-dimensional inputs, large-scale simulation campaigns, and nonstationary response surfaces. In previous applications, OSSs were one-at-a-time affairs for multiple outputs. We demonstrate that this leads to dissonance in calibration efforts for a common parameter set across outputs for the honeycomb. Instead, a conceptually straightforward, but implementationally intricate, principal-components representation, adapted from ordinary Gaussian process surrogate modeling to the OSS setting, can resolve this tension. With a two-pronged - optimization-based and fully Bayesian - approach, we show how pooled information across outputs can reduce uncertainty and enhance (statistical and computational) efficiency in calibrated parameters for the honeycomb relative to the previous, "data-poor" univariate analog.

Read more
Methodology

Multilevel calibration weighting for survey data

A pressing challenge in modern survey research is to find calibration weights when covariates are high dimensional and especially when interactions between variables are important. Traditional approaches like raking typically fail to balance higher-order interactions; and post-stratification, which exactly balances all interactions, is only feasible for a small number of variables. In this paper, we propose multilevel calibration weighting, which enforces tight balance constraints for marginal balance and looser constraints for higher-order interactions. This incorporates some of the benefits of post-stratification while retaining the guarantees of raking. We then correct for the bias due to the relaxed constraints via a flexible outcome model; we call this approach Double Regression with Post-stratification (DRP). We characterize the asymptotic properties of these estimators and show that the proposed calibration approach has a dual representation as a multilevel model for survey response. We assess the performance of this method via an extensive simulation study and show how it can reduce bias in a case-study of a large-scale survey of voter intention in the 2016 U.S. presidential election. The approach is available in the multical R package.

Read more
Methodology

Multiple-trait Adaptive Fisher's Method for Genome-wide Association Studies

In genome-wide association studies (GWASs), there is an increasing need for detecting the associations between a genetic variant and multiple traits. In studies of complex diseases, it is common to measure several potentially correlated traits in a single GWAS. Despite the multivariate nature of the studies, single-trait-based methods remain the most widely-adopted analysis procedure, owing to their simplicity for studies with multiple traits as their outcome. However, the association between a genetic variant and a single trait sometimes can be weak, and ignoring the actual correlation among traits may lose power. On the contrary, multiple-trait analysis, a method analyzes a group of traits simultaneously, has been proven to be more powerful by incorporating information from the correlated traits. Although existing methods have been developed for multiple traits, several drawbacks limit their wide application in GWASs. In this paper, we propose a multiple-trait adaptive Fisher's (MTAF) method to test associations between a genetic variant and multiple traits at once, by adaptively aggregating evidence from each trait. The proposed method can accommodate both continuous and binary traits and it has reliable performance under various scenarios. Using a simulation study, we compared our proposed method with several existing methods and demonstrated its competitiveness in terms of type I error control and statistical power. By applying the method to the Study of Addiction: Genetics and Environment (SAGE) dataset, we successfully identified several genes associated with substance dependence.

Read more
Methodology

Multipopulation mortality modelling and forecasting: The multivariate functional principal component with time weightings approaches

Human mortality patterns and trajectories in closely related populations are likely linked together and share similarities. It is always desirable to model them simultaneously while taking their heterogeneity into account. This paper introduces two new models for joint mortality modelling and forecasting multiple subpopulations in adaptations of the multivariate functional principal component analysis techniques. The first model extends the independent functional data model to a multi-population modelling setting. In the second one, we propose a novel multivariate functional principal component method for coherent modelling. Its design primarily fulfils the idea that when several subpopulation groups have similar socio-economic conditions or common biological characteristics, such close connections are expected to evolve in a non-diverging fashion. We demonstrate the proposed methods by using sex-specific mortality data. Their forecast performances are further compared with several existing models, including the independent functional data model and the Product-Ratio model, through comparisons with mortality data of ten developed countries. Our experiment results show that the first proposed model maintains a comparable forecast ability with the existing methods. In contrast, the second proposed model outperforms the first model as well as the current models in terms of forecast accuracy, in addition to several desirable properties.

Read more
Methodology

Multivariate binary probability distribution in the Grassmann formalism

We propose a probability distribution for multivariate binary random variables. For this purpose, we use the Grassmann number, an anti-commuting number. In our model, the partition function, the central moment, and the marginal and conditional distributions are expressed analytically by the matrix of the parameters analogous to the covariance matrix in the multivariate Gaussian distribution. That is, summation over all possible states is not necessary for obtaining the partition function and various expected values, which is a problem with the conventional multivariate Bernoulli distribution. The proposed model has many similarities to the multivariate Gaussian distribution. For example, the marginal and conditional distributions are expressed by the parameter matrix and its inverse matrix, respectively. That is, the inverse matrix expresses a sort of partial correlation. Analytical expressions for the marginal and conditional distributions are also useful in generating random numbers for multivariate binary variables. Hence, we validated the proposed method using synthetic datasets. We observed that the sampling distributions of various statistics are consistent with the theoretical predictions and estimates are consistent and asymptotically normal.

Read more
Methodology

Multivariate phase-type theory for the site frequency spectrum

Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package phasty, and R code for the reproduction of our results is available as an accompanying vignette.

Read more
Methodology

Narrowest Significance Pursuit: inference for multiple change-points in linear models

We propose Narrowest Significance Pursuit (NSP), a general and flexible methodology for automatically detecting localised regions in data sequences which each must contain a change-point, at a prescribed global significance level. Here, change-points are understood as abrupt changes in the parameters of an underlying linear model. NSP works by fitting the postulated linear model over many regions of the data, using a certain multiresolution sup-norm loss, and identifying the shortest interval on which the linearity is significantly violated. The procedure then continues recursively to the left and to the right until no further intervals of significance can be found. The use of the multiresolution sup-norm loss is a key feature of NSP, as it enables the transfer of significance considerations to the domain of the unobserved true residuals, a substantial simplification. It also guarantees important stochastic bounds which directly yield exact desired coverage probabilities, regardless of the form or number of the regressors. NSP works with a wide range of distributional assumptions on the errors, including Gaussian with known or unknown variance, some light-tailed distributions, and some heavy-tailed, possibly heterogeneous distributions via self-normalisation. It also works in the presence of autoregression. The mathematics of NSP is, by construction, uncomplicated, and its key computational component uses simple linear programming. In contrast to the widely studied "post-selection inference" approach, NSP enables the opposite viewpoint and paves the way for the concept of "post-inference selection". Pre-CRAN R code implementing NSP is available at this https URL.

Read more

Ready to get started?

Join us today