Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hidetoshi Shimodaira is active.

Publication


Featured researches published by Hidetoshi Shimodaira.


Systematic Biology | 2002

An Approximately Unbiased Test of Phylogenetic Tree Selection

Hidetoshi Shimodaira

An approximately unbiased (AU) test that uses a newly devised multiscale bootstrap technique was developed for general hypothesis testing of regions in an attempt to reduce test bias. It was applied to maximum-likelihood tree selection for obtaining the confidence set of trees. The AU test is based on the theory of Efron et al. (Proc. Natl. Acad. Sci. USA 93:13429-13434; 1996), but the new method provides higher-order accuracy yet simpler implementation. The AU test, like the Shimodaira-Hasegawa (SH) test, adjusts the selection bias overlooked in the standard use of the bootstrap probability and Kishino-Hasegawa tests. The selection bias comes from comparing many trees at the same time and often leads to overconfidence in the wrong trees. The SH test, though safe to use, may exhibit another type of bias such that it appears conservative. Here I show that the AU test is less biased than other methods in typical cases of tree selection. These points are illustrated in a simulation study as well as in the analysis of mammalian mitochondrial protein sequences. The theoretical argument provides a simple formula that covers the bootstrap probability test, the Kishino-Hasegawa test, the AU test, and the Zharkikh-Li test. A practical suggestion is provided as to which test should be used under particular circumstances.


Bioinformatics | 2001

CONSEL: for assessing the confidence of phylogenetic tree selection

Hidetoshi Shimodaira; Masami Hasegawa

UNLABELLED CONSEL is a program to assess the confidence of the tree selection by giving the p-values for the trees. The main thrust of the program is to calculate the p-value of the Approximately Unbiased (AU) test using the multi-scale bootstrap technique. This p-value is less biased than the other conventional p-values such as the Bootstrap Probability (BP), the Kishino-Hasegawa (KH) test, the Shimodaira-Hasegawa (SH) test, and the Weighted Shimodaira-Hasegawa (WSH) test. CONSEL calculates all these p-values from the output of the phylogeny program packages such as Molphy, PAML, and PAUP*. Furthermore, CONSEL is applicable to a wide class of problems where the BPs are available. AVAILABILITY The programs are written in C language. The source code for Unix and the executable binary for DOS are found at http://www.ism.ac.jp/~shimo/ CONTACT [email protected]


Bioinformatics | 2006

Pvclust: an R package for assessing the uncertainty in hierarchical clustering

Ryota Suzuki; Hidetoshi Shimodaira

SUMMARY Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis. Pvclust can be used easily for general statistical problems, such as DNA microarray analysis, to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis. Pvclust calculates probability values (p-values) for each cluster using bootstrap resampling techniques. Two types of p-values are available: approximately unbiased (AU) p-value and bootstrap probability (BP) value. Multiscale bootstrap resampling is used for the calculation of AU p-value, which has superiority in bias over BP value calculated by the ordinary bootstrap resampling. In addition the computation time can be enormously decreased with parallel computing option.


Journal of Statistical Planning and Inference | 2000

Improving predictive inference under covariate shift by weighting the log-likelihood function

Hidetoshi Shimodaira

Abstract A class of predictive densities is derived by weighting the observed samples in maximizing the log-likelihood function. This approach is effective in cases such as sample surveys or design of experiments, where the observed covariate follows a different distribution than that in the whole population. Under misspecification of the parametric model, the optimal choice of the weight function is asymptotically shown to be the ratio of the density function of the covariate in the population to that in the observations. This is the pseudo-maximum likelihood estimation of sample surveys. The optimality is defined by the expected Kullback–Leibler loss, and the optimal weight is obtained by considering the importance sampling identity. Under correct specification of the model, however, the ordinary maximum likelihood estimate (i.e. the uniform weight) is shown to be optimal asymptotically. For moderate sample size, the situation is in between the two extreme cases, and the weight function is selected by minimizing a variant of the information criterion derived as an estimate of the expected loss. The method is also applied to a weighted version of the Bayesian predictive density. Numerical examples as well as Monte-Carlo simulations are shown for polynomial regression. A connection with the robust parametric estimation is discussed.


Annals of Statistics | 2004

Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling

Hidetoshi Shimodaira

Approximately unbiased tests based on bootstrap probabilities are considered for the exponential family of distributions with unknown expectation parameter vector, where the null hypothesis is represented as an arbitraryshaped region with smooth boundaries. This problem has been discussed previously in Efron and Tibshirani [Ann. Statist. 26 (1998) 1687–1718], and a corrected p-value with second-order asymptotic accuracy is calculated by the two-level bootstrap of Efron, Halloran and Holmes [Proc. Natl. Acad. Sci. U.S.A. 93 (1996) 13429–13434] based on the ABC bias correction of Efron [J. Amer. Statist. Assoc. 82 (1987) 171–185]. Our argument is an extension of their asymptotic theory, where the geometry, such as the signed distance and the curvature of the boundary, plays an important role. We give another calculation of the corrected p-value without finding the “nearest point” on the boundary to the observation, which is required in the two-level bootstrap and is an implementational burden in complicated problems. The key idea is to alter the sample size of the replicated dataset from that of the observed dataset. The frequency of the replicates falling in the region is counted for several sample sizes, and then the p-value is calculated by looking at the change in the frequencies along the changing sample sizes. This is the multiscale bootstrap of Shimodaira [Systematic Biology 51 (2002) 492–508], which is third-order accurate for the multivariate normal model. Here we introduce a newly devised multistep-multiscale bootstrap, calculating a third-order accurate p-value for the exponential family of distributions. In fact, our p-value is asymptotically equivalent to those obtained by the double bootstrap of Hall [The Bootstrap and Edgeworth Expansion(1992) Springer, New York] and the modified signed likelihood ratio of Barndorff-Nielsen [ Biometrika 73 (1986) 307–322] ignoring O(n −3/2 ) terms, yet the computation is less demanding and free from model specification. The algorithm is remarkably simple despite complexity of the theory behind it. The differences of the pvalues are illustrated in simple examples, and the accuracies of the bootstrap methods are shown in a systematic way.


Annals of the Institute of Statistical Mathematics | 1998

An Application of Multiple Comparison Techniques to Model Selection

Hidetoshi Shimodaira

Akaikes information criterion (AIC) is widely used to estimate the best model from a given candidate set of parameterized probabilistic models. In this paper, considering the sampling error of AIC, a set of good models is constructed rather than choosing a single model. This set is called a confidence set of models, which includes the minimum ε{AIC} model at an error rate smaller than the specified significance level. The result is given as P-value for each model, from which the confidence set is immediately obtained. A variant of Guptas subset selection procedure is devised, in which a standardized difference of AIC is calculated for every pair of models. The critical constants are computed by the Monte-Carlo method, where the asymptotic normal approximation of AIC is used. The proposed method neither requires the full model nor assumes a hierarchical structure of models, and it has higher power than similar existing methods.


Journal of Statistical Planning and Inference | 2003

Active learning algorithm using the maximum weighted log-likelihood estimator

Takafumi Kanamori; Hidetoshi Shimodaira

Abstract We study the problems of constructing designs for the regression problems. Our aim is to estimate the mean value of the response variable. The distribution of the independent variable is appropriately chosen from among the continuous designs so as to decrease the integrated mean square error (IMSE) of the fitted values. When we use the design, we face obstacles such that the true regression function may not belong to the statistical model, that is the model is misspecified. In the case of misspecification, the estimation of the mean value of response variable by using the design has bias. We suggest a new method to construct the design which does not have the bias even when the statistical model is misspecified. On the standard construction of the design, the maximum log-likelihood estimator (mle) is used. On the other hand, we use the maximum weighted log-likelihood estimator (mwle). The design with mle increase the bias in the case of misspecification. The mwle corrects the mle and decreases the bias term of IMSE. We give some numerical experiments and illustrate the efficiency of the proposed methods.


Communications in Statistics-theory and Methods | 2001

MULTIPLE COMPARISONS OF LOG-LIKELIHOODS AND COMBINING NONNESTED MODELS WITH APPLICATIONS TO PHYLOGENETIC TREE SELECTION

Hidetoshi Shimodaira

We consider multiple comparisons of log-likelihoods to take account of the multiplicity of testings in selection of nonnested models. A resampling version of the Gupta procedure for the selection problem is used to obtain a set of good models, which are not significantly worse than the maximum likelihood model; i.e., a confidence set of models. Our method is to test which model is better than the other, while the object of the classical testing methods is to find the correct model. Thus the null hypotheses behind these two approaches are very different. Our method and the other commonly used approaches, such as the approximate Bayesian posterior, the bootstrap selection probability, and the LR test against the full model, are applied to the selection of molecular phylogenetic tree of mammal species. Tree selection is a version of the model-based clustering, which is an example of nonnested model selection. It is shown that the structure of the tree selection problem is equivalent to that of the variable selection problem of the multiple regression with some constraints on the combinations of the variables. It turns out that the LR test rejects all the possible trees because of the misspecification of the models, whereas our method gives a reasonable confidence set. For a better understanding of the uncertainty in the selection, we combine the maximum likelihood estimates (MLEs) of the trees to obtain the full model that includes the trees as the submodels by using a linear approximation of the parametric models. The MLE of the phylogeny is then represented as a network of species rather than a tree. A geometrical interpretation of the problem is also discussed.


Archive | 1994

A new criterion for selecting models from partially observed data

Hidetoshi Shimodaira

A new criterion PDIO (predictive divergence for indirect observation models) is proposed for selecting statistical models from partially observed data. PDIO is devised for “indirect observation models”, in which observations are only available indirectly through random variables. That is, some underlying hidden structure is assumed to generate the manifest variables. For example, unsupervised learning recognition systems, clustering, latent structure analysis, mixture distribution models, missing data, noisy observations, etc., or the models whose maximum likelihood estimator is based on the EM (expectation-maximization) algorithm. PDIO is a natural extension of AIC (Akaike’s information criterion), and the two criteria are equivalent when direct observations are available. Both criteria are expressed as the sum of two terms: the first term represents the goodness of fit of the model to the observed data, and the second term represents the model complexity. The goodness ot fit terms are equivalent in both criteria, but the complexity terms are different. The complexity term is a function of model structure and the number of samples and is added in order to take into account the reliability of the observed data. A mean fluctuation of the estimated true distribution is used as the model complexity in PDIO. The relative relation of the “model manifold” and the “observed manifold” is, therefore, reflected in the complexity term of PDIO from the information geometric point of view, whereas it reduces to the number of parameters in AIC. PDIO is very unique in dealing with the unobservable underlying structure “positively.” In this paper the generalized expression of PDIO is shown using two Fisher information matrices. An approximated computation method for PDIO is also presented utilizing EM iterates. Some computer simulations are shown to demonstrate how this criterion works.


PLOS ONE | 2010

A scale-free structure prior for graphical models with applications in functional genomics.

Paul Sheridan; Takeshi Kamimura; Hidetoshi Shimodaira

The problem of reconstructing large-scale, gene regulatory networks from gene expression data has garnered considerable attention in bioinformatics over the past decade with the graphical modeling paradigm having emerged as a popular framework for inference. Analysis in a full Bayesian setting is contingent upon the assignment of a so-called structure prior—a probability distribution on networks, encoding a priori biological knowledge either in the form of supplemental data or high-level topological features. A key topological consideration is that a wide range of cellular networks are approximately scale-free, meaning that the fraction, , of nodes in a network with degree is roughly described by a power-law with exponent between and . The standard practice, however, is to utilize a random structure prior, which favors networks with binomially distributed degree distributions. In this paper, we introduce a scale-free structure prior for graphical models based on the formula for the probability of a network under a simple scale-free network model. Unlike the random structure prior, its scale-free counterpart requires a node labeling as a parameter. In order to use this prior for large-scale network inference, we design a novel Metropolis-Hastings sampler for graphical models that includes a node labeling as a state space variable. In a simulation study, we demonstrate that the scale-free structure prior outperforms the random structure prior at recovering scale-free networks while at the same time retains the ability to recover random networks. We then estimate a gene association network from gene expression data taken from a breast cancer tumor study, showing that scale-free structure prior recovers hubs, including the previously unknown hub SLC39A6, which is a zinc transporter that has been implicated with the spread of breast cancer to the lymph nodes. Our analysis of the breast cancer expression data underscores the value of the scale-free structure prior as an instrument to aid in the identification of candidate hub genes with the potential to direct the hypotheses of molecular biologists, and thus drive future experiments.

Collaboration


Dive into the Hidetoshi Shimodaira's collaboration.

Top Co-Authors

Avatar

Paul Sheridan

Northern General Hospital

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Takeshi Kamimura

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Hidemoto Nakada

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ryota Suzuki

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge