Featured Researches

Statistics Theory

Ridge Regression Revisited: Debiasing, Thresholding and Bootstrap

The success of the Lasso in the era of high-dimensional data can be attributed to its conducting an implicit model selection, i.e., zeroing out regression coefficients that are not significant. By contrast, classical ridge regression can not reveal a potential sparsity of parameters, and may also introduce a large bias under the high-dimensional setting. Nevertheless, recent work on the Lasso involves debiasing and thresholding, the latter in order to further enhance the model selection. As a consequence, ridge regression may be worth another look since -- after debiasing and thresholding -- it may offer some advantages over the Lasso, e.g., it can be easily computed using a closed-form expression. % and it has similar performance to threshold Lasso. In this paper, we define a debiased and thresholded ridge regression method, and prove a consistency result and a Gaussian approximation theorem. We further introduce a wild bootstrap algorithm to construct confidence regions and perform hypothesis testing for a linear combination of parameters. In addition to estimation, we consider the problem of prediction, and present a novel, hybrid bootstrap algorithm tailored for prediction intervals. Extensive numerical simulations further show that the debiased and thresholded ridge regression has favorable finite sample performance and may be preferable in some settings.

Read more
Statistics Theory

Right-truncated Archimedean and related copulas

The copulas of random vectors with standard uniform univariate margins truncated from the right are considered and a general formula for such right-truncated conditional copulas is derived. This formula is analytical for copulas that can be inverted analytically as functions of each single argument. This is the case, for example, for Archimedean and related copulas. The resulting right-truncated Archimedean copulas are not only analytically tractable but can also be characterized as tilted Archimedean copulas. This finding allows one, for example, to more easily derive analytical properties such as the coefficients of tail dependence or sampling procedures of right-truncated Archimedean copulas. As another result, one can easily obtain a limiting Clayton copula for a general vector of truncation points converging to zero; this is an important property for (re)insurance and a fact already known in the special case of equal truncation points, but harder to prove without aforementioned characterization. Furthermore, right-truncated Archimax copulas with logistic stable tail dependence functions are characterized as tilted outer power Archimedean copulas and an analytical form of right-truncated nested Archimedean copulas is also derived.

Read more
Statistics Theory

Risk Bounds for Quantile Trend Filtering

We study quantile trend filtering, a recently proposed method for nonparametric quantile regression with the goal of generalizing existing risk bounds known for the usual trend filtering estimators which perform mean regression. We study both the penalized and the constrained version (of order r≥1 ) of univariate quantile trend filtering. Our results show that both the constrained and the penalized version (of order r≥1 ) attain the minimax rate up to log factors, when the (r−1) th discrete derivative of the true vector of quantiles belongs to the class of bounded variation signals. Moreover we also show that if the true vector of quantiles is a discrete spline with a few polynomial pieces then both versions attain a near parametric rate of convergence. Corresponding results for the usual trend filtering estimators are known to hold only when the errors are sub-Gaussian. In contrast, our risk bounds are shown to hold under minimal assumptions on the error variables. In particular, no moment assumptions are needed and our results hold under heavy-tailed errors. %On the other hand, we prove all our results for a Huber type loss which can be smaller than the mean squared error loss employed for showing risk bounds for usual trend filtering. Our proof techniques are general and thus can potentially be used to study other nonparametric quantile regression methods. To illustrate this generality we also employ our proof techniques to obtain new results for multivariate quantile total variation denoising and high dimensional quantile linear regression.

Read more
Statistics Theory

Risk upper bounds for RKHS ridge group sparse estimator in the regression model with non-Gaussian and non-bounded error

We consider the problem of estimating a meta-model of an unknown regression model with non-Gaussian and non-bounded error. The meta-model belongs to a reproducing kernel Hilbert space constructed as a direct sum of Hilbert spaces leading to an additive decomposition including the variables and interactions between them. The estimator of this meta-model is calculated by minimizing an empirical least-squares criterion penalized by the sum of the Hilbert norm and the empirical L 2 -norm. In this context, the upper bounds of the empirical L 2 risk and the L 2 risk of the estimator are established.

Read more
Statistics Theory

Robust Kernel Density Estimation with Median-of-Means principle

In this paper, we introduce a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Median-of-Means principle (MoM-KDE). This estimator is shown to achieve robustness to any kind of anomalous data, even in the case of adversarial contamination. In particular, while previous works only prove consistency results under known contamination model, this work provides finite-sample high-probability error-bounds without a priori knowledge on the outliers. Finally, when compared with other robust kernel estimators, we show that MoM-KDE achieves competitive results while having significant lower computational complexity.

Read more
Statistics Theory

Robust Persistence Diagrams using Reproducing Kernels

Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel. This, in turn, allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets.

Read more
Statistics Theory

Robust W-GAN-Based Estimation Under Wasserstein Contamination

Robust estimation is an important problem in statistics which aims at providing a reasonable estimator when the data-generating distribution lies within an appropriately defined ball around an uncontaminated distribution. Although minimax rates of estimation have been established in recent years, many existing robust estimators with provably optimal convergence rates are also computationally intractable. In this paper, we study several estimation problems under a Wasserstein contamination model and present computationally tractable estimators motivated by generative adversarial networks (GANs). Specifically, we analyze properties of Wasserstein GAN-based estimators for location estimation, covariance matrix estimation, and linear regression and show that our proposed estimators are minimax optimal in many scenarios. Finally, we present numerical results which demonstrate the effectiveness of our estimators.

Read more
Statistics Theory

Robust and efficient mean estimation: approach based on the properties of self-normalized sums

Let X be a random variable with unknown mean and finite variance. We present a new estimator of the mean of X that is robust with respect to the possible presence of outliers in the sample, provides tight sub-Gaussian deviation guarantees without any additional assumptions on the shape or tails of the distribution, and moreover is asymptotically efficient. This is the first estimator that provably combines all these qualities in one package. Our construction is inspired by robustness properties possessed by the self-normalized sums. Finally, theoretical findings are supplemented by numerical simulations highlighting the strong performance of the proposed estimator in comparison with previously known techniques.

Read more
Statistics Theory

Robust regression with covariate filtering: Heavy tails and adversarial contamination

We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step.

Read more
Statistics Theory

Row-column factorial designs with multiple levels

An {\em m?n row-column factorial design} is an arrangement of the elements of a factorial design into a rectangular array. Such an array is used in experimental design, where the rows and columns can act as blocking factors. If for each row/column and vector position, each element has the same regularity, then all main effects can be estimated without confounding by the row and column blocking factors. Formally, for any integer q , let [q]={0,1,??q??} . The q k (full) factorial design with replication α is the multi-set consisting of α occurrences of each element of [q ] k ; we denote this by α?[q ] k . A {\em regular m?n row-column factorial design} is an arrangement of the the elements of α?[q ] k into an m?n array (which we say is of {\em type} I k (m,n;q) ) such that for each row (column) and fixed vector position i?�[q] , each element of [q] occurs n/q times (respectively, m/q times). Let m?�n . We show that an array of type I k (m,n;q) exists if and only if (a) q|m and q|n ; (b) q k |mn ; (c) (k,q,m,n)??2,6,6,6) and (d) if (k,q,m)=(2,2,2) then 4 divides n . This extends the work of Godolphin (2019), who showed the above is true for the case q=2 when m and n are powers of 2 . In the case k=2 , the above implies necessary and sufficient conditions for the existence of a pair of mutually orthogonal frequency rectangles (or F -rectangles) whenever each symbol occurs the same number of times in a given row or column.

Read more

Ready to get started?

Join us today