Jonathan M. Keith | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jonathan M. Keith is active.

Explore More

Publication

Featured researches published by Jonathan M. Keith.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2011

Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression, Random Forest and Bayesian Logistic Regression

Carla Chen; Holger Schwender; Jonathan M. Keith; Robin Nunkesser; Kerrie Mengersen; Paula E. Macrossan

Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.

Methodology and Computing in Applied Probability | 2004

A Generalized Markov Sampler

Jonathan M. Keith; Dirk P. Kroese; Darryn E. Bryant

A recent development of the Markov chain Monte Carlo (MCMC) technique is the emergence of MCMC samplers that allow transitions between different models. Such samplers make possible a range of computational tasks involving models, including model selection, model evaluation, model averaging and hypothesis testing. An example of this type of sampler is the reversible jump MCMC sampler, which is a generalization of the Metropolis–Hastings algorithm. Here, we present a new MCMC sampler of this type. The new sampler is a generalization of the Gibbs sampler, but somewhat surprisingly, it also turns out to encompass as particular cases all of the well-known MCMC samplers, including those of Metropolis, Barker, and Hastings. Moreover, the new sampler generalizes the reversible jump MCMC. It therefore appears to be a very general framework for MCMC sampling. This paper describes the new sampler and illustrates its use in three applications in Computational Biology, specifically determination of consensus sequences, phylogenetic inference and delineation of isochores via multiple change-point analysis.

winter simulation conference | 2002

Sequence alignment by rare event simulation

Jonathan M. Keith; Dirk P. Kroese

We present a new stochastic method for finding the optimal alignment of DNA sequences. The method works by generating random paths through a graph (the edit graph) according to a Markov chain. Each path is assigned a score, and these scores are used to modify the transition probabilities of the Markov chain. This procedure converges to a fixed path through the graph, corresponding to the optimal (or near-optimal) sequence alignment. The rules with which to update the transition probabilities are based on Rubinsteins (1999, 2000) cross-entropy method, a new technique for stochastic optimization. This leads to very simple and natural updating formulas. Due to its versatility, mathematical tractability and simplicity, the method has great potential for a large class of combinatorial optimization problems, in particular in biological sciences.

Journal of Computational Biology | 2006

Segmenting eukaryotic genomes with the Generalized Gibbs Sampler.

Jonathan M. Keith

Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the authors website (www.uq.edu.au/ approximately uqjkeith/) or upon request to the author.

Journal of Computational Biology | 2008

Delineating slowly and rapidly evolving fractions of the Drosophila genome

Jonathan M. Keith; Peter Adams; Stuart Stephen; John S. Mattick

Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.

Molecular Biology and Evolution | 2010

Multiple evolutionary rate classes in animal genome evolution

Christopher Oldmeadow; Kerrie Mengersen; John S. Mattick; Jonathan M. Keith

The proportion of functional sequence in the human genome is currently a subject of debate. The most widely accepted figure is that approximately 5% is under purifying selection. In Drosophila, estimates are an order of magnitude higher, though this corresponds to a similar quantity of sequence. These estimates depend on the difference between the distribution of genomewide evolutionary rates and that observed in a subset of sequences presumed to be neutrally evolving. Motivated by the widening gap between these estimates and experimental evidence of genome function, especially in mammals, we developed a sensitive technique for evaluating such distributions and found that they are much more complex than previously apparent. We found strong evidence for at least nine well-resolved evolutionary rate classes in an alignment of four Drosophila species and at least seven classes in an alignment of four mammals, including human. We also identified at least three rate classes in human ancestral repeats. By positing that the largest of these ancestral repeat classes is neutrally evolving, we estimate that the proportion of nonneutrally evolving sequence is 30% of human ancestral repeats and 45% of the aligned portion of the genome. However, we also question whether any of the classes represent neutrally evolving sequences and argue that a plausible alternative is that they reflect variable structure-function constraints operating throughout the genomes of complex organisms.

Proceedings of the National Academy of Sciences of the United States of America | 2013

Agent-based Bayesian approach to monitoring the progress of invasive species eradication programs

Jonathan M. Keith; Daniel Spring

Eradication of an invasive species can provide significant environmental, economic, and social benefits, but eradication programs often fail. Constant and careful monitoring improves the chance of success, but an invasion may seem to be in decline even when it is expanding in abundance or spatial extent. Determining whether an invasion is in decline is a challenging inference problem for two reasons. First, it is typically infeasible to regularly survey the entire infested region owing to high cost. Second, surveillance methods are imperfect and fail to detect some individuals. These two factors also make it difficult to determine why an eradication program is failing. Agent-based methods enable inferences to be made about the locations of undiscovered individuals over time to identify trends in invader abundance and spatial extent. We develop an agent-based Bayesian method and apply it to Australia’s largest eradication program: the campaign to eradicate the red imported fire ant (Solenopsis invicta) from Brisbane. The invasion was deemed to be almost eradicated in 2004 but our analyses indicate that its geographic range continued to expand despite a sharp decline in number of nests. We also show that eradication would probably have been achieved with a relatively small increase in the area searched and treated. Our results demonstrate the importance of inferring temporal and spatial trends in ongoing invasions. The method can handle incomplete observations and takes into account the effects of human intervention. It has the potential to transform eradication practices.

winter simulation conference | 2007

Parallel cross-entropy optimization

Gareth Evans; Jonathan M. Keith; Dirk P. Kroese

The cross-entropy (CE) method is a modern and effective optimization method well suited to parallel implementations. There is a vast array of problems today, some of which are highly complex and can take weeks or even longer to solve using current optimization techniques. This paper presents a general method for designing parallel CE algorithms for multiple instruction multiple data (MIVID) distributed memory machines using the message passing interface (MPI) library routines. We provide examples of its performance for two well-known test-cases: the (discrete) Max-Cut problem and (continuous) Rosenbrock problem. Speedup factors and a comparison to sequential CE methods are reported.

congress on evolutionary computation | 2007

Bayesian inference in estimation of distribution algorithms

Marcus Gallagher; Ian A. Wood; Jonathan M. Keith; George Y. Sofronov

Metaheuristics such as Estimation of Distribution Algorithms and the Cross-Entropy method use probabilistic modelling and inference to generate candidate solutions in optimization problems. The model fitting task in this class of algorithms has largely been carried out to date based on maximum likelihood. An alternative approach that is prevalent in statistics and machine learning is to use Bayesian inference. In this paper, we provide a framework for the application of Bayesian inference techniques in probabilistic model-based optimization. Based on this framework, a simple continuous Bayesian Estimation of Distribution Algorithm is described. We evaluate and compare this algorithm experimentally with its maximum likelihood equivalent, UMDAG c.

PLOS ONE | 2012

Computational Characterization of 3′ Splice Variants in the GFAP Isoform Family

Sarah E. Boyd; Betina Nair; Sze Woei Ng; Jonathan M. Keith; Jacqueline M. Orian

Glial fibrillary acidic protein (GFAP) is an intermediate filament (IF) protein specific to central nervous system (CNS) astrocytes. It has been the subject of intense interest due to its association with neurodegenerative diseases, and because of growing evidence that IF proteins not only modulate cellular structure, but also cellular function. Moreover, GFAP has a family of splicing isoforms apparently more complex than that of other CNS IF proteins, consistent with it possessing a range of functional and structural roles. The gene consists of 9 exons, and to date all isoforms associated with 3′ end splicing have been identified from modifications within intron 7, resulting in the generation of exon 7a (GFAPδ/ε) and 7b (GFAPκ). To better understand the nature and functional significance of variation in this region, we used a Bayesian multiple change-point approach to identify conserved regions. This is the first successful application of this method to a single gene – it has previously only been used in whole-genome analyses. We identified several highly or moderately conserved regions throughout the intron 7/7a/7b regions, including untranslated regions and regulatory features, consistent with the biology of GFAP. Several putative unconfirmed features were also identified, including a possible new isoform. We then integrated multiple computational analyses on both the DNA and protein sequences from the mouse, rat and human, showing that the major isoform, GFAPα, has highly conserved structure and features across the three species, whereas the minor isoforms GFAPδ/ε and GFAPκ have low conservation of structure and features at the distal 3′ end, both relative to each other and relative to GFAPα. The overall picture suggests distinct and tightly regulated functions for the 3′ end isoforms, consistent with complex astrocyte biology. The results illustrate a computational approach for characterising splicing isoform families, using both DNA and protein sequences.

Explore More