Paula Tataru
Aarhus University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paula Tataru.
BMC Bioinformatics | 2011
Paula Tataru; Asger Hobolth
BackgroundContinuous time Markov chains (CTMCs) is a widely used model for describing the evolution of DNA sequences on the nucleotide, amino acid or codon level. The sufficient statistics for CTMCs are the time spent in a state and the number of changes between any two states. In applications past evolutionary events (exact times and types of changes) are unaccessible and the past must be inferred from DNA sequence data observed in the present.ResultsWe describe and implement three algorithms for computing linear combinations of expected values of the sufficient statistics, conditioned on the end-points of the chain, and compare their performance with respect to accuracy and running time. The first algorithm is based on an eigenvalue decomposition of the rate matrix (EVD), the second on uniformization (UNI), and the third on integrals of matrix exponentials (EXPM). The implementation in R of the algorithms is available at http://www.birc.au.dk/~paula/.ConclusionsWe use two different models to analyze the accuracy and eight experiments to investigate the speed of the three algorithms. We find that they have similar accuracy and that EXPM is the slowest method. Furthermore we find that UNI is usually faster than EVD.
BMC Bioinformatics | 2012
James W. J. Anderson; Paula Tataru; Joe Staines; Jotun Hein; Rune B. Lyngsø
BackgroundStochastic Context–Free Grammars (SCFGs) were applied successfully to RNA secondary structure prediction in the early 90s, and used in combination with comparative methods in the late 90s. The set of SCFGs potentially useful for RNA secondary structure prediction is very large, but a few intuitively designed grammars have remained dominant. In this paper we investigate two automatic search techniques for effective grammars – exhaustive search for very compact grammars and an evolutionary algorithm to find larger grammars. We also examine whether grammar ambiguity is as problematic to structure prediction as has been previously suggested.ResultsThese search techniques were applied to predict RNA secondary structure on a maximal data set and revealed new and interesting grammars, though none are dramatically better than classic grammars. In general, results showed that many grammars with quite different structure could have very similar predictive ability. Many ambiguous grammars were found which were at least as effective as the best current unambiguous grammars.ConclusionsOverall the method of evolving SCFGs for RNA secondary structure prediction proved effective in finding many grammars that had strong predictive accuracy, as good or slightly better than those designed manually. Furthermore, several of the best grammars found were ambiguous, demonstrating that such grammars should not be disregarded.
Genetics | 2017
Paula Tataru; Maeva Mollion; Sylvain Glémin; Thomas Bataillon
The distribution of fitness effects (DFE) encompasses the fraction of deleterious, neutral, and beneficial mutations. It conditions the evolutionary trajectory of populations, as well as the rate of adaptive molecular evolution (α). Inferring DFE and α from patterns of polymorphism, as given through the site frequency spectrum (SFS) and divergence data, has been a longstanding goal of evolutionary genetics. A widespread assumption shared by previous inference methods is that beneficial mutations only contribute negligibly to the polymorphism data. Hence, a DFE comprising only deleterious mutations tends to be estimated from SFS data, and α is then predicted by contrasting the SFS with divergence data from an outgroup. We develop a hierarchical probabilistic framework that extends previous methods to infer DFE and α from polymorphism data alone. We use extensive simulations to examine the performance of our method. While an outgroup is still needed to obtain an unfolded SFS, we show that both a DFE, comprising both deleterious and beneficial mutations, and α can be inferred without using divergence data. We also show that not accounting for the contribution of beneficial mutations to polymorphism data leads to substantially biased estimates of the DFE and α. We compare our framework with one of the most widely used inference methods available and apply it on a recently published chimpanzee exome data set.
Systematic Biology | 2016
Paula Tataru; Maria Simonsen; Thomas Bataillon; Asger Hobolth
Abstract The Wright‐Fisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the Wright‐Fisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the Wright‐Fisher model, and we consider these in turn. We begin our review with the basic bi‐allelic Wright‐Fisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion‐based and moment‐based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi‐allelic process with a general mutation model.
Genetics | 2015
Paula Tataru; Thomas Bataillon; Asger Hobolth
The large amount and high quality of genomic data available today enable, in principle, accurate inference of evolutionary histories of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytical form. Existing approximations build on the computationally intensive diffusion limit or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (0 and 1). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here we introduce the beta with spikes, an extension of the beta approximation that explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations with comparable performance to an existing state-of-the-art method.
Proceedings of the National Academy of Sciences of the United States of America | 2018
Jesper T. Bjerg; Henricus T. S. Boschker; Steffen Larsen; David Berry; Markus Schmid; Diego Millo; Paula Tataru; Filip J. R. Meysman; Michael Wagner; Lars Peter Nielsen; Andreas Schramm
Significance Cable bacteria are centimeter-long, multicellular filamentous bacteria, which are globally occurring in marine and freshwater sediments. Their presence coincides with the occurrence of electrical fields, and gradients of oxygen and sulfide that are best explained by electron transport from sulfide to oxygen along the cable-bacteria filaments, implying electric conductance by living bacteria over centimeter distances. Until now, all indications for such long-distance electron transport were derived from bulk sediment incubations. Here we present measurements on individual cable-bacteria filaments that allow us to quantify a voltage drop along cable-bacteria filaments and show a transport of electrons over several millimeters. This is orders of magnitude longer than previously known for biological electron transport. Electron transport within living cells is essential for energy conservation in all respiring and photosynthetic organisms. While a few bacteria transport electrons over micrometer distances to their surroundings, filaments of cable bacteria are hypothesized to conduct electric currents over centimeter distances. We used resonance Raman microscopy to analyze cytochrome redox states in living cable bacteria. Cable-bacteria filaments were placed in microscope chambers with sulfide as electron source and oxygen as electron sink at opposite ends. Along individual filaments a gradient in cytochrome redox potential was detected, which immediately broke down upon removal of oxygen or laser cutting of the filaments. Without access to oxygen, a rapid shift toward more reduced cytochromes was observed, as electrons were no longer drained from the filament but accumulated in the cellular cytochromes. These results provide direct evidence for long-distance electron transport in living multicellular bacteria.
bioRxiv | 2016
Morten Muhlig Nielsen; Paula Tataru; Tobias Madsen; Asger Hobolth; Jakob Skou Pedersen
Motif analysis has long been an important method to characterize biological functionality and the current growth of sequencing-based genomics experiments further extends its potential. These diverse experiments often generate sequence lists ranked by some functional property. There is therefore a growing need for motif analysis methods that can exploit this coupled data structure and be tailored for specific biological questions. Here, we present a motif analysis tool, Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in a ranked list of sequences. Regmex uses regular expressions to define motifs or families of motifs and embedded Markov models to calculate exact probabilities for motif observations in sequences. Motif enrichment is optionally evaluated using random walks, Brownian bridges, or modified rank based statistics. These features make Regmex well suited for a range of biological sequence analysis problems related to motif discovery. We demonstrate different usage scenarios including rank correlation of microRNA binding sites co-occurring with a U-rich motif. The method is available as an R package.
Biology | 2013
Paula Tataru; Andreas Sand; Asger Hobolth; Thomas Mailund; Christian N. S. Pedersen
Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.
Biology Letters | 2016
Belén Jiménez-Mena; Paula Tataru; Rasmus Froberg Brøndum; Goutam Sahana; Bernt Guldbrandtsen; Thomas Bataillon
Effective population size (Ne) is a central parameter in population and conservation genetics. It measures the magnitude of genetic drift, rates of accumulation of inbreeding in a population, and it conditions the efficacy of selection. It is often assumed that a single Ne can account for the evolution of genomes. However, recent work provides indirect evidence for heterogeneity in Ne throughout the genome. We study this by examining genome-wide diversity in the Danish Holstein cattle breed. Using the differences in allele frequencies over a single generation, we directly estimated Ne among autosomes and smaller windows within autosomes. We found statistically significant variation in Ne at both scales. However, no correlation was found between the detected regional variability in Ne, and proxies for the intensity of linked selection (local recombination rate, gene density), or the presence of either past strong selection or current artificial selection on traits of economic value. Our findings call for further caution regarding the wide applicability of the Ne concept for understanding quantitatively processes such as genetic drift and accumulation of consanguinity in both natural and managed populations.
bioRxiv | 2018
Paula Tataru; Thomas Bataillon
Distributions of fitness effects (DFE) of mutations can be inferred from site frequency spectrum (SFS) data. There is mounting interest to determine whether distinct genomic regions and/or species share a common DFE, or whether evidence exists for differences among them. polyDFEv2.0 fits multiple SFS datasets at once and provides likelihood ratio tests for DFE invariance across datasets. Simulations show that testing for DFE invariance across genomic regions within a species requires models accounting for heterogeneous genealogical histories underlying SFS data in these regions. Not accounting for these heterogeneities will result in the spurious detection of DFE differences.