Featured Researches

Statistics Theory

Random and quasi-random designs in group testing

For large classes of group testing problems, we derive lower bounds for the probability that all significant factors are uniquely identified using specially constructed random designs. These bounds allow us to optimize parameters of the randomization schemes. We also suggest and numerically justify a procedure of construction of designs with better separability properties than pure random designs. We illustrate theoretical consideration with large simulation-based study. This study indicates, in particular, that in the case of the common binary group testing, the suggested families of designs have better separability than the popular designs constructed from the disjunct matrices.

Read more
Statistics Theory

Rank-based Estimation under Asymptotic Dependence and Independence, with Applications to Spatial Extremes

Multivariate extreme value theory is concerned with modeling the joint tail behavior of several random variables. Existing work mostly focuses on asymptotic dependence, where the probability of observing a large value in one of the variables is of the same order as observing a large value in all variables simultaneously. However, there is growing evidence that asymptotic independence is equally important in real world applications. Available statistical methodology in the latter setting is scarce and not well understood theoretically. We revisit non-parametric estimation and introduce rank-based M-estimators for parametric models that simultaneously work under asymptotic dependence and asymptotic independence, without requiring prior knowledge on which of the two regimes applies. Asymptotic normality of the proposed estimators is established under weak regularity conditions. We further show how bivariate estimators can be leveraged to obtain parametric estimators in spatial tail models, and again provide a thorough theoretical justification for our approach.

Read more
Statistics Theory

Rates of Convergence for Laplacian Semi-Supervised Learning with Low Labeling Rates

We study graph-based Laplacian semi-supervised learning at low labeling rates. Laplacian learning uses harmonic extension on a graph to propagate labels. At very low label rates, Laplacian learning becomes degenerate and the solution is roughly constant with spikes at each labeled data point. Previous work has shown that this degeneracy occurs when the number of labeled data points is finite while the number of unlabeled data points tends to infinity. In this work we allow the number of labeled data points to grow to infinity with the number of labels. Our results show that for a random geometric graph with length scale ε>0 and labeling rate β>0 , if β≪ ε 2 then the solution becomes degenerate and spikes form, and if β≫ ε 2 then Laplacian learning is well-posed and consistent with a continuum Laplace equation. Furthermore, in the well-posed setting we prove quantitative error estimates of O(ε β −1/2 ) for the difference between the solutions of the discrete problem and continuum PDE, up to logarithmic factors. We also study p -Laplacian regularization and show the same degeneracy result when β≪ ε p . The proofs of our well-posedness results use the random walk interpretation of Laplacian learning and PDE arguments, while the proofs of the ill-posedness results use Γ -convergence tools from the calculus of variations. We also present numerical results on synthetic and real data to illustrate our results.

Read more
Statistics Theory

Reciprocal Maximum Likelihood Degrees of Brownian Motion Tree Models

We give an explicit formula for the reciprocal maximum likelihood degree of Brownian motion tree models. To achieve this, we connect them to certain toric (or log-linear) models, and express the Brownian motion tree model of an arbitrary tree as a toric fiber product of star tree models.

Read more
Statistics Theory

Reconstructing measures on manifolds: an optimal transport approach

Assume that we observe i.i.d. points lying close to some unknown d -dimensional C k submanifold M in a possibly high-dimensional space. We study the problem of reconstructing the probability distribution generating the sample. After remarking that this problem is degenerate for a large class of standard losses ( L p , Hellinger, total variation, etc.), we focus on the Wasserstein loss, for which we build an estimator, based on kernel density estimation, whose rate of convergence depends on d and the regularity s?�k?? of the underlying density, but not on the ambient dimension. In particular, we show that the estimator is minimax and matches previous rates in the literature in the case where the manifold M is a d -dimensional cube. The related problem of the estimation of the volume measure of M for the Wasserstein loss is also considered, for which a minimax estimator is exhibited.

Read more
Statistics Theory

Regression modelling with I-priors

We introduce the I-prior methodology as a unifying framework for estimating a variety of regression models, including varying coefficient, multilevel, longitudinal models, and models with functional covariates and responses. It can also be used for multi-class classification, with low or high dimensional covariates. The I-prior is generally defined as a maximum entropy prior. For a regression function, the I-prior is Gaussian with covariance kernel proportional to the Fisher information on the regression function, which is estimated by its posterior distribution under the I-prior. The I-prior has the intuitively appealing property that the more information is available on a linear functional of the regression function, the larger the prior variance, and the smaller the influence of the prior mean on the posterior distribution. Advantages compared to competing methods, such as Gaussian process regression or Tikhonov regularization, are ease of estimation and model comparison. In particular, we develop an EM algorithm with a simple E and M step for estimating hyperparameters, facilitating estimation for complex models. We also propose a novel parsimonious model formulation, requiring a single scale parameter for each (possibly multidimensional) covariate and no further parameters for interaction effects. This simplifies estimation because fewer hyperparameters need to be estimated, and also simplifies model comparison of models with the same covariates but different interaction effects; in this case, the model with the highest estimated likelihood can be selected. Using a number of widely analyzed real data sets we show that predictive performance of our methodology is competitive. An R-package implementing the methodology is available (Jamil, 2019).

Read more
Statistics Theory

Regression-type analysis for block maxima on block maxima

This paper devises a regression-type model for the situation where both the response and covariates are extreme. The proposed approach is designed for the setting where both the response and covariates are themselves block maxima, and thus contrarily to standard regression methods it takes into account the key fact that the limiting distribution of suitably standardized componentwise maxima is an extreme value copula. An important target in the proposed framework is the regression manifold, which consists of a family of regression lines obeying the latter asymptotic result. To learn about the proposed model from data, we employ a Bernstein polynomial prior on the space of angular densities which leads to an induced prior on the space of regression manifolds. Numerical studies suggest a good performance of the proposed methods, and a finance real-data illustration reveals interesting aspects on the comovements of extreme losses between two leading stock markets.

Read more
Statistics Theory

Relaxing monotonicity in endogenous selection models and application to surveys

This paper considers endogenous selection models, in particular nonparametric ones. Estimating the unconditional law of the outcomes is possible when one uses instrumental variables. Using a selection equation which is additively separable in a one dimensional unobservable has the sometimes undesirable property of instrument monotonicity. We present models which allow for nonmonotonicity and are based on nonparametric random coefficients indices. We discuss their nonparametric identification and apply these results to inference on nonlinear statistics such as the Gini index in surveys when the nonresponse is not missing at random.

Read more
Statistics Theory

Reliable Covariance Estimation

Covariance or scatter matrix estimation is ubiquitous in most modern statistical and machine learning applications. The task becomes especially challenging since most real-world datasets are essentially non-Gaussian. The data is often contaminated by outliers and/or has heavy-tailed distribution causing the sample covariance to behave very poorly and calling for robust estimation methodology. The natural framework for the robust scatter matrix estimation is based on elliptical populations. Here, Tyler's estimator stands out by being distribution-free within the elliptical family and easy to compute. The existing works thoroughly study the performance of Tyler's estimator assuming ellipticity but without providing any tools to verify this assumption when the covariance is unknown in advance. We address the following open question: Given the sampled data and having no prior on the data generating process, how to assess the quality of the scatter matrix estimator? In this work we show that this question can be reformulated as an asymptotic uniformity test for certain sequences of exchangeable vectors on the unit sphere. We develop a consistent and easily applicable goodness-of-fit test against all alternatives to ellipticity when the scatter matrix is unknown. The findings are supported by numerical simulations demonstrating the power of the suggest technique.

Read more
Statistics Theory

Representation of Context-Specific Causal Models with Observational and Interventional Data

We consider the problem of representing causal models that encode context-specific information for discrete data. To represent such models we use a proper subclass of staged tree models which we call CStrees. We show that the context-specific information encoded by a CStree can be equivalently expressed via a collection of DAGs. As not all staged tree models admit this property, CStrees are a subclass that provides a transparent, intuitive and compact representation of context-specific causal information. Model equivalence for CStrees also takes a simpler form than for general staged trees: We provide a characterization of the complete set of asymmetric conditional independence relations encoded by a CStree. As a consequence, we obtain a global Markov property for CStrees which leads to a graphical criterion of model equivalence for CStrees generalizing that of Verma and Pearl for DAG models. In addition, we provide a closed-form formula for the maximum likelihood estimator of a CStree and use it to show that the Bayesian information criterion is a locally consistent score function for this model class. We also give an analogous global Markov property and characterization of model equivalence for general interventions in CStrees. As examples, we apply these results to two real data sets, and examine how BIC-optimal CStrees for each provide a clear and concise representation of the learned context-specific causal structure.

Read more

Ready to get started?

Join us today