Featured Researches

Methodology

Conditional Variance Estimator for Sufficient Dimension Reduction

Conditional Variance Estimation (CVE) is a novel sufficient dimension reduction (SDR) method for additive error regressions with continuous predictors and link function. It operates under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. In contrast to the majority of moment based sufficient dimension reduction methods, Conditional Variance Estimation is fully data driven, does not require the restrictive linearity and constant variance conditions, and is not based on inverse regression. CVE is shown to be consistent and its objective function to be uniformly convergent. CVE outperforms the mean average variance estimation, (MAVE), its main competitor, in several simulation settings, remains on par under others, while it always outperforms the usual inverse regression based linear SDR methods, such as Sliced Inverse Regression.

Read more
Methodology

Conditioning on the pre-test versus gain score modeling: revisiting the controversy in a multilevel setting

We consider estimating the effect of a treatment on the progress of subjects tested both before and after treatment assignment. A vast literature compares the competing approaches of modeling the post-test score conditionally on the pre-test score versus modeling the difference, namely the gain score. Our contribution resides in analyzing the merits and drawbacks of the two approaches in a multilevel setting. This is relevant in many fields, for example education with students nested into schools. The multilevel structure raises peculiar issues related to the contextual effects and the distinction between individual-level and cluster-level treatment. We derive approximate analytical results and compare the two approaches by a simulation study. For an individual-level treatment our findings are in line with the literature, whereas for a cluster-level treatment we point out the key role of the cluster mean of the pre-test score, which favors the conditioning approach in settings with large clusters.

Read more
Methodology

Confidence intervals for parameters in high-dimensional sparse vector autoregression

Vector autoregression (VAR) models are widely used to analyze the interrelationship between multiple variables over time. Estimation and inference for the transition matrices of VAR models are crucial for practitioners to make decisions in fields such as economics and finance. However, when the number of variables is larger than the sample size, it remains a challenge to perform statistical inference of the model parameters. In this article, we propose the de-biased Lasso and two bootstrap de-biased Lasso methods to construct confidence intervals for the elements of the transition matrices of high-dimensional VAR models. We show that the proposed methods are asymptotically valid under appropriate sparsity and other regularity conditions. To implement our methods, we develop feasible and parallelizable algorithms, which save a large amount of computation required by the nodewise Lasso and bootstrap. A simulation study illustrates that our methods perform well in finite samples. Finally, we apply our methods to analyze the price data of stocks in the S&P 500 index in 2019. We find that some stocks, such as the largest producer of gold in the world, Newmont Corporation, have significant predictive power over the most stocks.

Read more
Methodology

Confidence intervals in general regression models that utilize uncertain prior information

We consider a general regression model, without a scale parameter. Our aim is to construct a confidence interval for a scalar parameter of interest θ that utilizes the uncertain prior information that a distinct scalar parameter τ takes the specified value t . This confidence interval should have good coverage properties. It should also have scaled expected length, where the scaling is with respect to the usual confidence interval, that (a) is substantially less than 1 when the prior information is correct, (b) has a maximum value that is not too large and (c) is close to 1 when the data and prior information are highly discordant. The asymptotic joint distribution of the maximum likelihood estimators θ and τ is similar to the joint distributions of these estimators in the particular case of a linear regression with normally distributed errors having known variance. This similarity is used to construct a confidence interval with the desired properties by using the confidence interval, computed using the R package ciuupi, that utilizes the uncertain prior information in this particular linear regression case. An important practical application of this confidence interval is to a quantal bioassay carried out to compare two similar compounds. In this context, the uncertain prior information is that the hypothesis of "parallelism" holds. We provide extensive numerical results that illustrate the properties of this confidence interval in this context.

Read more
Methodology

Confidently Comparing Estimators with the c-value

Modern statistics provides an ever-expanding toolkit for estimating unknown parameters. Consequently, applied statisticians frequently face a difficult decision: retain a parameter estimate from a familiar method or replace it with an estimate from a newer or complex one. While it is traditional to compare estimators using risk, such comparisons are rarely conclusive in realistic settings. In response, we propose the "c-value" as a measure of confidence that a new estimate achieves smaller loss than an old estimate on a given dataset. We show that it is unlikely that a computed c-value is large and that the new estimate has larger loss than the old. Therefore, just as a small p-value provides evidence to reject a null hypothesis, a large c-value provides evidence to use a new estimate in place of the old. For a wide class of problems and estimators, we show how to compute a c-value by first constructing a data-dependent high-probability lower bound on the difference in loss. The c-value is frequentist in nature, but we show that it can provide a validation of Bayesian estimates in real data applications involving hierarchical models and Gaussian processes.

Read more
Methodology

Consistent detection and optimal localization of all detectable change points in piecewise stationary arbitrarily sparse network-sequences

We consider the offline change point detection and localization problem in the context of piecewise stationary networks, where the observable is a finite sequence of networks. We develop algorithms involving some suitably modified CUSUM statistics based on adaptively trimmed adjacency matrices of the observed networks for both detection and localization of single or multiple change points present in the input data. We provide rigorous theoretical analysis and finite sample estimates evaluating the performance of the proposed methods when the input (finite sequence of networks) is generated from an inhomogeneous random graph model, where the change points are characterized by the change in the mean adjacency matrix. We show that the proposed algorithms can detect (resp. localize) all change points, where the change in the expected adjacency matrix is above the minimax detectability (resp. localizability) threshold, consistently without any a priori assumption about (a) a lower bound for the sparsity of the underlying networks, (b) an upper bound for the number of change points, and (c) a lower bound for the separation between successive change points, provided either the minimum separation between successive pairs of change points or the average degree of the underlying networks goes to infinity arbitrarily slowly. We also prove that the above condition is necessary to have consistency.

Read more
Methodology

Constructing Confidence Intervals for the Signals in Sparse Phase Retrieval

In this paper, we provide a general methodology to draw statistical inferences on individual signal coordinates or linear combinations of them in sparse phase retrieval. Given an initial estimator for the targeting parameter (some simple function of the signal), which is generated by some existing algorithm, we can modify it in a way that the modified version is asymptotically normal and unbiased. Then confidence intervals and hypothesis testings can be constructed based on this asymptotic normality. For conciseness, we focus on confidence intervals in this work, while a similar procedure can be adopted for hypothesis testings. Under some mild assumptions on the signal and sample size, we establish theoretical guarantees for the proposed method. These assumptions are generally weak in the sense that the dimension could exceed the sample size and many non-zero small coordinates are allowed. Furthermore, theoretical analysis reveals that the modified estimators for individual coordinates have uniformly bounded variance, and hence simultaneous interval estimation is possible. Numerical simulations in a wide range of settings are supportive of our theoretical results.

Read more
Methodology

Constructing a More Closely Matched Control Group in a Difference-in-Differences Analysis: Its Effect on History Interacting with Group Bias

Difference-in-differences analysis with a control group that differs considerably from a treated group is vulnerable to bias from historical events that have different effects on the groups. Constructing a more closely matched control group by matching a subset of the overall control group to the treated group may result in less bias. We study this phenomenon in simulation studies. We study the effect of mountaintop removal mining (MRM) on mortality using a difference-in-differences analysis that makes use of the increase in MRM following the 1990 Clean Air Act Amendments. For a difference-in-differences analysis of the effect of MRM on mortality, we constructed a more closely matched control group and found a 95\% confidence interval that contains substantial adverse effects along with no effect and small beneficial effects.

Read more
Methodology

Continuum centroid classifier for functional data

Aiming at the binary classification of functional data, we propose the continuum centroid classifier (CCC) built upon projections of functional data onto one specific direction. This direction is obtained via bridging the regression and classification. Controlling the extent of supervision, our technique is neither unsupervised nor fully supervised. Thanks to the intrinsic infinite dimension of functional data, one of two subtypes of CCC enjoys the (asymptotic) zero misclassification rate. Our proposal includes an effective algorithm that yields a consistent empirical counterpart of CCC. Simulation studies demonstrate the performance of CCC in different scenarios. Finally, we apply CCC to two real examples.

Read more
Methodology

Contrastive latent variable modeling with application to case-control sequencing experiments

High-throughput RNA-sequencing (RNA-seq) technologies are powerful tools for understanding cellular state. Often it is of interest to quantify and summarize changes in cell state that occur between experimental or biological conditions. Differential expression is typically assessed using univariate tests to measure gene-wise shifts in expression. However, these methods largely ignore changes in transcriptional correlation. Furthermore, there is a need to identify the low-dimensional structure of the gene expression shift to identify collections of genes that change between conditions. Here, we propose contrastive latent variable models designed for count data to create a richer portrait of differential expression in sequencing data. These models disentangle the sources of transcriptional variation in different conditions, in the context of an explicit model of variation at baseline. Moreover, we develop a model-based hypothesis testing framework that can test for global and gene subset-specific changes in expression. We test our model through extensive simulations and analyses with count-based gene expression data from perturbation and observational sequencing experiments. We find that our methods can effectively summarize and quantify complex transcriptional changes in case-control experimental sequencing data.

Read more

Ready to get started?

Join us today