Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jiashun Jin is active.

Publication


Featured researches published by Jiashun Jin.


Annals of Statistics | 2004

Higher criticism for detecting sparse heterogeneous mixtures

David L. Donoho; Jiashun Jin

Higher criticism, or second-level significance testing, is a multiplecomparisons concept mentioned in passing by Tukey. It concerns a situation where there are many independent tests of significance and one is interested in rejecting the joint null hypothesis. Tukey suggested comparing the fraction of observed significances at a given �-level to the expected fraction under the joint null. In fact, he suggested standardizing the difference of the two quantities and forming a z-score; the resulting z-score tests the significance of the body of significance tests. We consider a generalization, where we maximize this z-score over a range of significance levels 0 < � � �0. We are able to show that the resulting higher criticism statistic is effective at resolving a very subtle testing problem: testing whether n normal means are all zero versus the alternative that a small fraction is nonzero. The subtlety of this “sparse normal means” testing problem can be seen from work of Ingster and Jin, who studied such problems in great detail. In their studies, they identified an interesting range of cases where the small fraction of nonzero means is so small that the alternative hypothesis exhibits little noticeable effect on the distribution of the p-values either for the bulk of the tests or for the few most highly significant tests. In this range, when the amplitude of nonzero means is calibrated with the fraction of nonzero means, the likelihood ratio test for a precisely specified alternative would still succeed in separating the two hypotheses. We show that the higher criticism is successful throughout the same region of amplitude sparsity where the likelihood ratio test


The EMBO Journal | 2007

Counting of six pRNAs of phi29 DNA-packaging motor with customized single-molecule dual-view system

Dan Shu; Hui Zhang; Jiashun Jin; Peixuan Guo

Direct imaging or counting of RNA molecules has been difficult owing to its relatively low electron density for EM and insufficient resolution in AFM. Bacteriophage phi29 DNA‐packaging motor is geared by a packaging RNA (pRNA) ring. Currently, whether the ring is a pentagon or hexagon is under fervent debate. We report here the assembly of a highly sensitive imaging system for direct counting of the copy number of pRNA within this 20‐nm motor. Single fluorophore imaging clearly identified the quantized photobleaching steps from pRNA labeled with a single fluorophore and concluded its stoichiometry within the motor. Almost all of the motors contained six copies of pRNA before and during DNA translocation, identified by dual‐color detection of the stalled intermediates of motors containing Cy3‐pRNA and Cy5‐DNA. The stalled motors were restarted to observe the motion of DNA packaging in real time. Heat‐denaturation analysis confirmed that the stoichiometry of pRNA is the common multiple of 2 and 3. EM imaging of procapsid/pRNA complexes clearly revealed six ferritin particles that were conjugated to each pRNA ring.


knowledge discovery and data mining | 2004

When do data mining results violate privacy

Murat Kantarcıoǧlu; Jiashun Jin; Chris Clifton

Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential privacy breach: Do the results themselves violate privacy? This paper explores this issue, developing a framework under which this question can be addressed. Metrics are proposed, along with analysis that those metrics are consistent in the face of apparent problems.


Annals of Statistics | 2010

INNOVATED HIGHER CRITICISM FOR DETECTING SPARSE SIGNALS IN CORRELATED NOISE

Peter Hall; Jiashun Jin

Higher Criticism is a method for detecting signals that are both sparse and weak. Although rst proposed in cases where the noise variables are independent, Higher Criticism also has reasonable performance in settings where those variables are correlated. In this paper we show that, by exploiting the nature of the correlation, performance can be improved by using a modied approach which exploits the potential advantages that correlation has to oer. Indeed, it turns out that the case of independent noise is the most dicult of all, from a statistical viewpoint, and that more accurate signal detection (for a given level of signal sparsity and strength) can be obtained when correlation is present. We characterize the advantages of correlation by showing how to incorporate them into the denition of an optimal detection boundary. The boundary has particularly attractive properties when correlation decays at a polynomial rate or the correlation matrix is Toeplitz.


Proceedings of the National Academy of Sciences of the United States of America | 2008

Higher criticism thresholding: Optimal feature selection when useful features are rare and weak

David Donoho; Jiashun Jin

In important application fields today—genomics and proteomics are examples—selecting a small subset of useful features is crucial for success of Linear Classification Analysis. We study feature selection by thresholding of feature Z-scores and introduce a principle of threshold selection, based on the notion of higher criticism (HC). For i = 1, 2, …, p, let πi denote the two-sided P-value associated with the ith feature Z-score and π(i) denote the ith order statistic of the collection of P-values. The HC threshold is the absolute Z-score corresponding to the P-value maximizing the HC objective (i/p − π(i))/ i/p(1−i/p). We consider a rare/weak (RW) feature model, where the fraction of useful features is small and the useful features are each too weak to be of much use on their own. HC thresholding (HCT) has interesting behavior in this setting, with an intimate link between maximizing the HC objective and minimizing the error rate of the designed classifier, and very different behavior from popular threshold selection procedures such as false discovery rate thresholding (FDRT). In the most challenging RW settings, HCT uses an unconventionally low threshold; this keeps the missed-feature detection rate under better control than FDRT and yields a classifier with improved misclassification performance. Replacing cross-validated threshold selection in the popular Shrunken Centroid classifier with the computationally less expensive and simpler HCT reduces the variance of the selected threshold and the error rate of the constructed classifier. Results on standard real datasets and in asymptotic theory confirm the advantages of HCT.


Annals of Statistics | 2015

Fast community detection by SCORE

Jiashun Jin

communities. The community labels for the nodes are unknown and it is of major interest to estimate them (i.e., community detection). Degree Corrected Block Model (DCBM) is a popular network model. How to detect communities with the DCBM is an interesting problem, where the main challenge lies in the degree heterogeneity. We propose a new approach to community detection which we call the Spectral Clustering On Ratios-of-Eigenvectors (SCORE). Compared to classical spectral methods, the main innovation is to use the entry-wise ratios between the rst leading eigenvector and each of the other leading eigenvectors for clustering. Let X be the adjacency matrix of the network. We rst obtain the K leading eigenvectors, say, ^ 1;:::; ^ K, and let ^ R be then (K 1) matrix such that ^ R(i;k) = ^ k+1(i)=^ 1(i), 1 i n, 1 k K 1. We then use ^ R for clustering by applying the k-means method. The central surprise is, the eect of degree heterogeneity is largely ancillary, and can be eectively removed by taking entry-wise ratios between ^ k+1 and ^ 1, 1 k K 1. The method is successfully applied to the web blogs data and the karate club data, with error rates of 58=1222 and 1=34, respectively. These results are much more satisfactory than those by the classical spectral methods. Also, compared to modularity methods, SCORE is computationally much faster and has smaller error rates. We develop a theoretic framework where we show that under mild conditions, the SCORE stably yields successful community detection. In the core of the analysis is the recent development on Random Matrix Theory (RMT), where the matrix-form Bernstein inequality is especially helpful.


Philosophical Transactions of the Royal Society A | 2009

Feature selection by higher criticism thresholding achieves the optimal phase diagram.

David L. Donoho; Jiashun Jin

We consider two-class linear classification in a high-dimensional, small-sample-size setting. Only a small fraction of the features are useful, these being unknown to us, and each useful feature contributes weakly to the classification decision. This was called the rare/weak (RW) model in our previous study (Donoho, D. & Jin, J. 2008Proc. Natl Acad. Sci. USA105, 14 790–14 795). We select features by thresholding feature Z-scores. The threshold is set by higher criticism (HC). For 1≤i≤N, let πi denote the p-value associated with the ith Z-score and π(i) denote the ith order statistic of the collection of p-values. The HC threshold (HCT) is the order statistic of the Z-score corresponding to index i maximizing . The ideal threshold optimizes the classification error. In that previous study, we showed that HCT was numerically close to the ideal threshold. We formalize an asymptotic framework for studying the RW model, considering a sequence of problems with increasingly many features and relatively fewer observations. We show that, along this sequence, the limiting performance of ideal HCT is essentially just as good as the limiting performance of ideal thresholding. Our results describe two-dimensional phase space, a two-dimensional diagram with coordinates quantifying ‘rare’ and ‘weak’ in the RW model. The phase space can be partitioned into two regions—one where ideal threshold classification is successful, and one where the features are so weak and so rare that it must fail. Surprisingly, the regions where ideal HCT succeeds and fails make exactly the same partition of the phase diagram. Other threshold methods, such as false (feature) discovery rate (FDR) threshold selection, are successful in a substantially smaller region of the phase space than either HCT or ideal thresholding. The FDR and local FDR of the ideal and HC threshold selectors have surprising phase diagrams, which are also described. Results showing the asymptotic equivalence of HCT with ideal HCT can be found in a forthcoming paper (Donoho, D. & Jin, J. In preparation).


Annals of Statistics | 2007

Estimation and Confidence Sets for Sparse Normal Mixtures

T. Tony Cai; Jiashun Jin; Mark G. Low

For high dimensional statistical models, researchers have begun to focus on situations which can be described as having relatively few moderately large coefficients. Such situations lead to some very subtle statistical problems. In particular, Ingster and Donoho and Jin have considered a sparse normal means testing problem, in which they described the precise demarcation or detection boundary. Meinshausen and Rice have shown that it is even possible to estimate consistently the fraction of nonzero coordinates on a subset of the detectable region, but leave unanswered the question of exactly in which parts of the detectable region consistent estimation is possible. In the present paper we develop a new approach for estimating the fraction of nonzero means for problems where the nonzero means are moderately large. We show that the detection region described by Ingster and Donoho and Jin turns out to be the region where it is possible to consistently estimate the expected fraction of nonzero coordinates. This theory is developed further and minimax rates of convergence are derived. A procedure is constructed which attains the optimal rate of convergence in this setting. Furthermore, the procedure also provides an honest lower bound for confidence intervals while minimizing the expected length of such an interval. Simulations are used to enable comparison with the work of Meinshausen and Rice, where a procedure is given but where rates of convergence have not been discussed. Extensions to more general Gaussian mixture models are also given.


Annals of Statistics | 2012

UPS Delivers Optimal Phase Diagram in High Dimensional Variable Selection

Pengsheng Ji; Jiashun Jin

We consider a linear regression model Y = X +z; z N(0;In); X = Xn;p; where both p and n are large but p > n. The vector is unknown but is sparse in the sense that only a small proportion of its coordinates is nonzero, and we are interested in identifying these nonzero ones. We model the coordinates of as samples from a two-component mixture (1 ) 0 + , and the rows of X as samples from N(0; 1 n ), where 0 is the point mass at 0, is a distribution, and is a p by p correlation matrix which is unknown but is presumably sparse. We propose a two-stage variable selection procedure which we call the UPS. This is a Screen and Clean procedure [32], in which we screen with the Univariate thresholding, and clean with the Penalized MLE. In many situations, the UPS possesses two important properties: Sure Screening and Separable After Screening (SAS). These properties enable us to reduce the original regression problem to many small-size regression problems that can be


Proceedings of the National Academy of Sciences of the United States of America | 2009

Impossibility of successful classification when useful features are rare and weak

Jiashun Jin

We study a two-class classification problem with a large number of features, out of which many are useless and only a few are useful, but we do not know which ones they are. The number of features is large compared with the number of training observations. Calibrating the model with 4 key parameters—the number of features, the size of the training sample, the fraction, and strength of useful features—we identify a region in parameter space where no trained classifier can reliably separate the two classes on fresh data. The complement of this region—where successful classification is possible—is also briefly discussed.

Collaboration


Dive into the Jiashun Jin's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

T. Tony Cai

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Zhigang Yao

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Peter Hall

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Dan Shu

University of Kentucky

View shared research outputs
Researchain Logo
Decentralizing Knowledge