Featured Researches

Statistics Theory

Filtering of stationary Gaussian statistical experiments

This article proposes a new filtering model for stationary Gaussian Markov statistical experiments, given by diffusion-type difference stochastic equations.

Read more
Statistics Theory

Finite mixture models do not reliably learn the number of components

Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. A common suggestion is to use a finite mixture model (FMM) with a prior on the number of components. Past work has shown the resulting FMM component-count posterior is consistent; that is, the posterior concentrates on the true generating number of components. But existing results crucially depend on the assumption that the component likelihoods are perfectly specified. In practice, this assumption is unrealistic, and empirical evidence suggests that the FMM posterior on the number of components is sensitive to the likelihood choice. In this paper, we add rigor to data-analysis folk wisdom by proving that under even the slightest model misspecification, the FMM component-count posterior diverges: the posterior probability of any particular finite number of latent components converges to 0 in the limit of infinite data. We illustrate practical consequences of our theory on simulated and real data sets.

Read more
Statistics Theory

Finite sample breakdown point of multivariate regression depth median

Depth induced multivariate medians (multi-dimensional maximum depth estimators) in regression serve as robust alternatives to the traditional least squares and least absolute deviations estimators. The induced median ($\bs{\beta}^*_{RD}$) from regression depth (RD) of Rousseeuw and Hubert (1999) (RH99) is one of the most prevailing estimators in regression. The maximum regression depth median possesses the outstanding robustness similar to the univariate location counterpart. Indeed, the %maximum depth estimator induced from RD , $\bs{\beta}^*_{RD}$ can, asymptotically, resist up to 33% contamination without breakdown, in contrast to the 0% for the traditional estimators %(i.e. they could break down by a single bad point) (see Van Aelst and Rousseeuw, 2000) (VAR00). The results from VAR00 are pioneering and innovative, yet they are limited to regression symmetric populations and the ϵ -contamination and maximum bias model. With finite fixed sample size practice, the most prevailing measure of robustness for estimators is the finite sample breakdown point (FSBP) (Donoho (1982), Donoho and Huber (1983)). A lower bound of the FSBP for the $\bs{\beta}^*_{RD}$ was given in RH99 (in a corollary of a conjecture). This article establishes a sharper lower bound and an upper bound of the FSBP for the $\bs{\beta}^*_{RD}$, revealing an intrinsic connection between the regression depth of $\bs{\beta}^*_{RD}$ and its FSBP, justifying the employment of the $\bs{\beta}^*_{RD}$ as a robust alternative to the traditional estimators and demonstrating the necessity and the merit of using FSBP in finite sample real practice instead of an asymptotic breakdown value.

Read more
Statistics Theory

Finite sample inference for generic autoregressive models

Autoregressive stationary processes are fundamental modeling tools in time series analysis. To conduct inference for such models usually requires asymptotic limit theorems. We establish finite sample-valid tools for hypothesis testing and confidence set construction in such settings. Further results are established in the always-valid and sequential inference framework.

Read more
Statistics Theory

Fitting inhomogeneous phase-type distributions to data: the univariate and the multivariate case

The class of inhomogeneous phase-type distributions (IPH) was recently introduced in Albrecher and Bladt (2019) as an extension of the classical phase-type (PH) distributions. Like PH distributions, the class of IPH is dense in the class of distributions on the positive halfline, but leads to more parsimonious models in the presence of heavy tails. In this paper we propose a fitting procedure for this class to given data. We furthermore consider an analogous extension of Kulkarni's multivariate phase-type class (Kulkarni, 1989) to the inhomogeneous framework and study parameter estimation for the resulting new and flexible class of multivariate distributions. As a by-product, we amend a previously suggested fitting procedure for the homogeneous multivariate phase-type case and provide appropriate adaptations for censored data. The performance of the algorithms is illustrated in several numerical examples, both for simulated and real-life insurance data.

Read more
Statistics Theory

Fixed-Domain Asymptotics Under Vecchia's Approximation of Spatial Process Likelihoods

Statistical modeling for massive spatial data sets has generated a substantial literature on scalable spatial processes based upon a likelihood approximation proposed by Vecchia in 1988. Vecchia's approximation for Gaussian process models enables fast evaluation of the likelihood by restricting dependencies at a location to its neighbors. We establish inferential properties of microergodic spatial covariance parameters within the paradigm of fixed-domain asymptotics when they are estimated using Vecchia's approximation. The conditions required to formally establish these properties are explored, theoretically and empirically, and the effectiveness of Vecchia's approximation is further corroborated from the standpoint of fixed-domain asymptotics. These explorations suggest some practical diagnostics for evaluating the quality of the approximation.

Read more
Statistics Theory

Forecasting time series with encoder-decoder neural networks

In this paper, we consider high-dimensional stationary processes where a new observation is generated from a compressed version of past observations. The specific evolution is modeled by an encoder-decoder structure. We estimate the evolution with an encoder-decoder neural network and give upper bounds for the expected forecast error under specific structural and sparsity assumptions. The results are shown separately for conditions either on the absolutely regular mixing coefficients or the functional dependence measure of the observed process. In a quantitative simulation we discuss the behavior of the network estimator under different model assumptions. We corroborate our theory by a real data example where we consider forecasting temperature data.

Read more
Statistics Theory

From Smooth Wasserstein Distance to Dual Sobolev Norm: Empirical Approximation and Statistical Applications

Statistical distances, i.e., discrepancy measures between probability distributions, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalability of the smooth framework to high dimensions, we conduct an in-depth study of the structural and statistical behavior of the Gaussian-smoothed p -Wasserstein distance W (?) p , for arbitrary p?? . We start by showing that W (?) p admits a metric structure that is topologically equivalent to classic W p and is stable with respect to perturbations in ? . Moving to statistical questions, we explore the asymptotic properties of W (?) p ( μ ^ n ,μ) , where μ ^ n is the empirical distribution of n i.i.d. samples from μ . To that end, we prove that W (?) p is controlled by a p th order smooth dual Sobolev norm d (?) p . Since d (?) p ( μ ^ n ,μ) coincides with the supremum of an empirical process indexed by Gaussian-smoothed Sobolev functions, it lends itself well to analysis via empirical process theory. We derive the limit distribution of n ??????d (?) p ( μ ^ n ,μ) in all dimensions d , when μ is sub-Gaussian. Through the aforementioned bound, this implies a parametric empirical convergence rate of n ??/2 for W (?) p , contrasting the n ??/d rate for unsmoothed W p when d?? . As applications, we provide asymptotic guarantees for two-sample testing and minimum distance estimation. When p=2 , we further show that d (?) 2 can be expressed as a maximum mean discrepancy.

Read more
Statistics Theory

Fréchet Sufficient Dimension Reduction for Random Objects

We in this paper consider Fréchet sufficient dimension reduction with responses being complex random objects in a metric space and high dimension Euclidean predictors. We propose a novel approach called weighted inverse regression ensemble method for linear Fréchet sufficient dimension reduction. The method is further generalized as a new operator defined on reproducing kernel Hilbert spaces for nonlinear Fréchet sufficient dimension reduction. We provide theoretical guarantees for the new method via asymptotic analysis. Intensive simulation studies verify the performance of our proposals. And we apply our methods to analyze the handwritten digits data to demonstrate its use in real applications.

Read more
Statistics Theory

Fully distribution-free center-outward rank tests for multiple-output regression and MANOVA

Extending rank-based inference to a multivariate setting such as multiple-output regression or MANOVA with unspecified d-dimensional error density has remained an open problem for more than half a century. None of the many solutions proposed so far is enjoying the combination of distribution-freeness and efficiency that makes rank-based inference a successful tool in the univariate setting. A concept of center-outward multivariate ranks and signs based on measure transportation ideas has been introduced recently. Center-outward ranks and signs are not only distribution-free but achieve in dimension d > 1 the (essential) maximal ancillarity property of traditional univariate ranks, hence carry all the "distribution-free information" available in the sample. We derive here the Hájek representation and asymptotic normality results required in the construction of center-outward rank tests for multiple-output regression and MANOVA. When based on appropriate spherical scores, these fully distribution-free tests achieve parametric efficiency in the corresponding models.

Read more

Ready to get started?

Join us today