Gesine Reinert | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gesine Reinert is active.

Explore More

Publication

Featured researches published by Gesine Reinert.

Journal of Computational Biology | 2000

Probabilistic and Statistical Properties of Words: An Overview

Gesine Reinert; Sophie Schbath; Michael S. Waterman

In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Steins method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, confidence intervals for tests.

Annals of Applied Probability | 1997

Stein's Method and the Zero Bias Transformation with Application to Simple Random Sampling

Larry B. Goldstein; Gesine Reinert

Let W be a random variable with mean zero and variance 2 . The distribution of a variate W , satisfying EWf(W) = 2 Ef 0 (W ) for smooth functions f, exists uniquely and defines the zero bias transformation on the distribution of W. The zero bias transformation shares many interesting properties with the well known size bias transformation for non-negative variables, but is applied to variables taking on both positive and negative values. The transformation can also be defined on more general random objects. The relation between the transformation and the expression wf 0 (w) 2 f 00 (w) which appears in the Stein equation characterizing the mean zero, variance 2 normal Z can be used to obtain bounds on the dierence E{h(W/ ) h(Z)} for smooth functions h by constructing the pair (W,W ) jointly on the same space. When W is a sum of n not necessarily independent variates, under certain conditions which include a vanishing third moment, bounds on this dierence of the order 1 /n for classes of smooth functions h may be obtained. The technique is illustrated by an application to simple random sampling.

Journal of Computational Biology | 2009

Alignment-Free Sequence Comparison (I): Statistics and Power

Gesine Reinert; David Chew; Fengzhu Sun; Michael S. Waterman

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(*). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(*), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.

Annals of Probability | 2009

Multivariate normal approximation with Stein’s method of exchangeable pairs under a general linearity condition

Gesine Reinert; Adrian Röllin

In this paper we establish a multivariate exchangeable pairs approach within the framework of Steins method to assess distributional distances to potentially singular multivariate normal distributions. By extending the statistics into a higher-dimensional space, we also propose an embedding method which allows for a normal approximation even when the corresponding statistics of interest do not lend themselves easily to Steins exchangeable pairs approach. To illustrate the method, we provide the examples of runs on the line as well as double-indexed permutation statistics.

Journal of Computational Biology | 2010

Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics

Lin Wan; Gesine Reinert; Fengzhu Sun; Michael S. Waterman

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

Briefings in Bioinformatics | 2014

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S. Waterman; Fengzhu Sun

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.

Journal of Computational Biology | 1996

Poisson Process Approximation for Sequence Repeats, and Sequencing by Hybridization

Richard Arratia; Daniela Martin; Gesine Reinert; Michael S. Waterman

Sequencing by hybridization is a tool to determine a DNA sequence from the unordered list of all l-tuples contained in this sequence; typical numbers for l are l = 8, 10, 12. For theoretical purposes we assume that the multiset of all l-tuples is known. This multiset determines the DNA sequence uniquely if none of the so-called Ukkonen transformations are possible. These transformations require repeats of (l-1)-tuples in the sequence, with these repeats occurring in certain spatial patterns. We model DNA as an i.i.d. sequence. We first prove Poisson process approximations for the process of indicators of all leftmost long repeats allowing self-overlap and for the process of indicators of all left-most long repeats without self-overlap. Using the Chen-Stein method, we get bounds on the error of these approximations. As a corollary, we approximate the distribution of longest repeats. In the second step we analyze the spatial patterns of the repeats. Finally we combine these two steps to prove an approximation for the probability that a random sequence is uniquely recoverable from its list of l-tuples. For all our results we give some numerical examples including error bounds.

BMC Systems Biology | 2010

Deciphering chemotaxis pathways using cross species comparisons

Rebecca Hamer; Pao-Yang Chen; Judith P. Armitage; Gesine Reinert; Charlotte M. Deane

BackgroundChemotaxis is the process by which motile bacteria sense their chemical environment and move towards more favourable conditions. Escherichia coli utilises a single sensory pathway, but little is known about signalling pathways in species with more complex systems.ResultsTo investigate whether chemotaxis pathways in other bacteria follow the E. coli paradigm, we analysed 206 species encoding at least 1 homologue of each of the 5 core chemotaxis proteins (CheA, CheB, CheR, CheW and CheY). 61 species encode more than one of all of these 5 proteins, suggesting they have multiple chemotaxis pathways. Operon information is not available for most bacteria, so we developed a novel statistical approach to cluster che genes into putative operons. Using operon-based models, we reconstructed putative chemotaxis pathways for all 206 species. We show that cheA-cheW and cheR-cheB have strong preferences to occur in the same operon as two-gene blocks, which may reflect a functional requirement for co-transcription. However, other che genes, most notably cheY, are more dispersed on the genome. Comparison of our operons with shuffled equivalents demonstrates that specific patterns of genomic location may be a determining factor for the observed in vivo chemotaxis pathways.We then examined the chemotaxis pathways of Rhodobacter sphaeroides. Here, the PpfA protein is known to be critical for correct partitioning of proteins in the cytoplasmically-localised pathway. We found ppfA in che operons of many species, suggesting that partitioning of cytoplasmic Che protein clusters is common. We also examined the apparently non-typical chemotaxis components, CheA3, CheA4 and CheY6. We found that though variants of CheA proteins are rare, the CheY6 variant may be a common type of CheY, with a significantly disordered C-terminal region which may be functionally significant.ConclusionsWe find that many bacterial species potentially have multiple chemotaxis pathways, with grouping of che genes into operons likely to be a major factor in keeping signalling pathways distinct. Gene order is highly conserved with cheA-cheW and cheR-cheB blocks, perhaps reflecting functional linkage. CheY behaves differently to other Che proteins, both in its genomic location and its putative protein interactions, which should be considered when modelling chemotaxis pathways.

Journal of Theoretical Biology | 2011

New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

Xuemei Liu; Lin Wan; Jing Li; Gesine Reinert; Michael S. Waterman; Fengzhu Sun

Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.

PLOS Computational Biology | 2008

Predicting and Validating Protein Interactions Using Network Structure

Pao-Yang Chen; Charlotte M. Deane; Gesine Reinert

Protein interactions play a vital part in the function of a cell. As experimental techniques for detection and validation of protein interactions are time consuming, there is a need for computational methods for this task. Protein interactions appear to form a network with a relatively high degree of local clustering. In this paper we exploit this clustering by suggesting a score based on triplets of observed protein interactions. The score utilises both protein characteristics and network properties. Our score based on triplets is shown to complement existing techniques for predicting protein interactions, outperforming them on data sets which display a high degree of clustering. The predicted interactions score highly against test measures for accuracy. Compared to a similar score derived from pairwise interactions only, the triplet score displays higher sensitivity and specificity. By looking at specific examples, we show how an experimental set of interactions can be enriched and validated. As part of this work we also examine the effect of different prior databases upon the accuracy of prediction and find that the interactions from the same kingdom give better results than from across kingdoms, suggesting that there may be fundamental differences between the networks. These results all emphasize that network structure is important and helps in the accurate prediction of protein interactions. The protein interaction data set and the program used in our analysis, and a list of predictions and validations, are available at http://www.stats.ox.ac.uk/bioinfo/resources/PredictingInteractions.

Explore More