Featured Researches

Methodology

Generalised Boosted Forests

This paper extends recent work on boosting random forests to model non-Gaussian responses. Given an exponential family E[Y|X]= g ?? (f(X)) our goal is to obtain an estimate for f . We start with an MLE-type estimate in the link space and then define generalised residuals from it. We use these residuals and some corresponding weights to fit a base random forest and then repeat the same to obtain a boost random forest. We call the sum of these three estimators a \textit{generalised boosted forest}. We show with simulated and real data that both the random forest steps reduces test-set log-likelihood, which we treat as our primary metric. We also provide a variance estimator, which we can obtain with the same computational cost as the original estimate itself. Empirical experiments on real-world data and simulations demonstrate that the methods can effectively reduce bias, and that confidence interval coverage is conservative in the bulk of the covariate distribution.

Read more
Methodology

Generalized Forward Sufficient Dimension Reduction for Categorical and Ordinal Responses

We present a forward sufficient dimension reduction method for categorical or ordinal responses by extending the outer product of gradients and minimum average variance estimator to multinomial generalized linear model. Previous work in this direction extend forward regression to binary responses, and are applied in a pairwise manner to multinomial data, which is less efficient than our approach. Like other forward regression-based sufficient dimension reduction methods, our approach avoids the relatively stringent distributional requirements necessary for inverse regression alternatives. We show consistency of our proposed estimator and derive its convergence rate. We develop an algorithm for our methods based on repeated applications of available algorithms for forward regression. We also propose a clustering-based tuning procedure to estimate the tuning parameters. The effectiveness of our estimator and related algorithms is demonstrated via simulations and applications.

Read more
Methodology

Generalized Liquid Association Analysis for Multimodal Data Integration

Multimodal data are now prevailing in scientific research. A central question in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the literature. In this article, we propose a novel generalized liquid association analysis method, which offers a new and unique angle to this important class of problems of studying three-way associations. We extend the notion of liquid association of \citet{li2002LA} from the univariate setting to the sparse, multivariate, and high-dimensional setting. We establish a population dimension reduction model, transform the problem to sparse Tucker decomposition of a three-way tensor, and develop a higher-order orthogonal iteration algorithm for parameter estimation. We derive the non-asymptotic error bound and asymptotic consistency of the proposed estimator, while allowing the variable dimensions to be larger than and diverge with the sample size. We demonstrate the efficacy of the method through both simulations and a multimodal neuroimaging application for Alzheimer's disease research.

Read more
Methodology

Generalized Score Matching for General Domains

Estimation of density functions supported on general domains arises when the data is naturally restricted to a proper subset of the real space. This problem is complicated by typically intractable normalizing constants. Score matching provides a powerful tool for estimating densities with such intractable normalizing constants, but as originally proposed is limited to densities on R m and R m + . In this paper, we offer a natural generalization of score matching that accommodates densities supported on a very general class of domains. We apply the framework to truncated graphical and pairwise interaction models, and provide theoretical guarantees for the resulting estimators. We also generalize a recently proposed method from bounded to unbounded domains, and empirically demonstrate the advantages of our method.

Read more
Methodology

Generalized k-Means in GLMs with Applications to the Outbreak of COVID-19 in the United States

Generalized k -means can be incorporated with any similarity or dissimilarity measure for clustering. By choosing the dissimilarity measure as the well known likelihood ratio or F -statistic, this work proposes a method based on generalized k -means to group statistical models. Given the number of clusters k , the method is established under hypothesis tests between statistical models. If k is unknown, then the method can be combined with GIC to automatically select the best k for clustering. The article investigates both AIC and BIC as the special cases. Theoretical and simulation results show that the number of clusters can be identified by BIC but not AIC. The resulting method for GLMs is used to group the state-level time series patterns for the outbreak of COVID-19 in the United States. A further study shows that the statistical models between the clusters are significantly different from each other. This study confirms the result given by the proposed method based on generalized k -means.

Read more
Methodology

Goal-oriented adaptive sampling under random field modelling of response probability distributions

In the study of natural and artificial complex systems, responses that are not completely determined by the considered decision variables are commonly modelled probabilistically, resulting in response distributions varying across decision space. We consider cases where the spatial variation of these response distributions does not only concern their mean and/or variance but also other features including for instance shape or uni-modality versus multi-modality. Our contributions build upon a non-parametric Bayesian approach to modelling the thereby induced fields of probability distributions, and in particular to a spatial extension of the logistic Gaussian model. The considered models deliver probabilistic predictions of response distributions at candidate points, allowing for instance to perform (approximate) posterior simulations of probability density functions, to jointly predict multiple moments and other functionals of target distributions, as well as to quantify the impact of collecting new samples on the state of knowledge of the distribution field of interest. In particular, we introduce adaptive sampling strategies leveraging the potential of the considered random distribution field models to guide system evaluations in a goal-oriented way, with a view towards parsimoniously addressing calibration and related problems from non-linear (stochastic) inversion and global optimisation.

Read more
Methodology

Goodness-of-fit Test on the Number of Biclusters in Relational Data Matrix

Biclustering is a method for detecting homogeneous submatrices in a given observed matrix, and it is an effective tool for relational data analysis. Although there are many studies that estimate the underlying bicluster structure of a matrix, few have enabled us to determine the appropriate number of biclusters in an observed matrix. Recently, a statistical test on the number of biclusters has been proposed for a regular-grid bicluster structure, where we assume that the latent bicluster structure can be represented by row-column clustering. However, when the latent bicluster structure does not satisfy such regular-grid assumption, the previous test requires a larger number of biclusters than necessary (i.e., a finer bicluster structure than necessary) for the null hypothesis to be accepted, which is not desirable in terms of interpreting the accepted bicluster structure. In this study, we propose a new statistical test on the number of biclusters that does not require the regular-grid assumption and derive the asymptotic behavior of the proposed test statistic in both null and alternative cases. To develop the proposed test, we construct a consistent submatrix localization algorithm, that is, the probability that it outputs the correct bicluster structure converges to one. We illustrate the effectiveness of the proposed method by applying it to both synthetic and practical relational data matrices.

Read more
Methodology

Goodness-of-fit tests for parametric regression models with circular response

Testing procedures for assessing a parametric regression model with circular response and R d -valued covariate are proposed and analyzed in this work both for independent and for spatially correlated data. The test statistics are based on a circular distance comparing a (non-smoothed or smoothed) parametric circular estimator and a nonparametric one. Properly designed bootstrap procedures for calibrating the tests in practice are also presented. Finite sample performance of the tests in different scenarios with independent and spatially correlated samples, is analyzed by simulations.

Read more
Methodology

Graphical Elastic Net and Target Matrices: Fast Algorithms and Software for Sparse Precision Matrix Estimation

We consider estimation of undirected Gaussian graphical models and inverse covariances in high-dimensional scenarios by penalizing the corresponding precision matrix. While single L 1 (Graphical Lasso) and L 2 (Graphical Ridge) penalties for the precision matrix have already been studied, we propose the combination of both, yielding an Elastic Net type penalty. We enable additional flexibility by allowing to include diagonal target matrices for the precision matrix. We generalize existing algorithms for the Graphical Lasso and provide corresponding software with an efficient implementation to facilitate usage for practitioners. Our software borrows computationally favorable parts from a number of existing packages for the Graphical Lasso, leading to an overall fast(er) implementation and at the same time yielding also much more methodological flexibility.

Read more
Methodology

Graphical Gaussian Process Models for Highly Multivariate Spatial Data

For multivariate spatial (Gaussian) process models, common cross-covariance functions do not exploit graphical models to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensure process-level conditional independence among variables. For the Matérn family of functions, stitching yields a multivariate GP whose univariate components are exactly Matérn GPs, and conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn GP to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.

Read more

Ready to get started?

Join us today