André Gohr
Martin Luther University of Halle-Wittenberg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by André Gohr.
PLOS ONE | 2014
Ralf Eggeling; André Gohr; Jens Keilwagen; Michaela Mohr; Stefan Posch; Andrew D. Smith; Ivo Grosse
The binding affinity of DNA-binding proteins such as transcription factors is mainly determined by the base composition of the corresponding binding site on the DNA strand. Most proteins do not bind only a single sequence, but rather a set of sequences, which may be modeled by a sequence motif. Algorithms for de novo motif discovery differ in their promoter models, learning approaches, and other aspects, but typically use the statistically simple position weight matrix model for the motif, which assumes statistical independence among all nucleotides. However, there is no clear justification for that assumption, leading to an ongoing debate about the importance of modeling dependencies between nucleotides within binding sites. In the past, modeling statistical dependencies within binding sites has been hampered by the problem of limited data. With the rise of high-throughput technologies such as ChIP-seq, this situation has now changed, making it possible to make use of statistical dependencies effectively. In this work, we investigate the presence of statistical dependencies in binding sites of the human enhancer-blocking insulator protein CTCF by using the recently developed model class of inhomogeneous parsimonious Markov models, which is capable of modeling complex dependencies while avoiding overfitting. These findings lead to a more detailed characterization of the CTCF binding motif, which is only poorly represented by independent nucleotide frequencies at several positions, predominantly at the 3′ end.
PLOS Computational Biology | 2012
Michael Seifert; André Gohr; Marc Strickert; Ivo Grosse
Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM).
international conference on data mining | 2007
Alexander Hinneburg; Hans-Henning Gabriel; André Gohr
Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.
european conference on machine learning | 2013
Ralf Eggeling; André Gohr; Pierre-Yves Bourguignon; Edgar Wingender; Ivo Grosse
We introduce inhomogeneous parsimonious Markov models for modeling statistical patterns in discrete sequences. These models are based on parsimonious context trees, which are a generalization of context trees, and thus generalize variable order Markov models. We follow a Bayesian approach, consisting of structure and parameter learning. Structure learning is a challenging problem due to an overexponential number of possible tree structures, so we describe an exact and efficient dynamic programming algorithm for finding the optimal tree structures. We apply model and learning algorithm to the problem of modeling binding sites of the human transcription factor C/EBP, and find an increased prediction performance compared to fixed order and variable order Markov models. We investigate the reason for this improvement and find several instances of context-specific dependences that can be captured by parsimonious context trees but not by traditional context trees.
PLOS ONE | 2015
Constance Mehlgarten; Jorrit-Jan Krijger; Ioana M. Lemnian; André Gohr; Lydia Kasper; Anne-Kathrin Diesing; Ivo Grosse; Karin D. Breunig
Cellular responses to starvation are of ancient origin since nutrient limitation has always been a common challenge to the stability of living systems. Hence, signaling molecules involved in sensing or transducing information about limiting metabolites are highly conserved, whereas transcription factors and the genes they regulate have diverged. In eukaryotes the AMP-activated protein kinase (AMPK) functions as a central regulator of cellular energy homeostasis. The yeast AMPK ortholog SNF1 controls the transcriptional network that counteracts carbon starvation conditions by regulating a set of transcription factors. Among those Cat8 and Sip4 have overlapping DNA-binding specificity for so-called carbon source responsive elements and induce target genes upon SNF1 activation. To analyze the evolution of the Cat8-Sip4 controlled transcriptional network we have compared the response to carbon limitation of Saccharomyces cerevisiae to that of Kluyveromyces lactis. In high glucose, S. cerevisiae displays tumor cell-like aerobic fermentation and repression of respiration (Crabtree-positive) while K. lactis has a respiratory-fermentative life-style, respiration being regulated by oxygen availability (Crabtree-negative), which is typical for many yeasts and for differentiated higher cells. We demonstrate divergent evolution of the Cat8-Sip4 network and present evidence that a role of Sip4 in controlling anabolic metabolism has been lost in the Saccharomyces lineage. We find that in K. lactis, but not in S. cerevisiae, the Sip4 protein plays an essential role in C2 carbon assimilation including induction of the glyoxylate cycle and the carnitine shuttle genes. Induction of KlSIP4 gene expression by KlCat8 is essential under these growth conditions and a primary function of KlCat8. Both KlCat8 and KlSip4 are involved in the regulation of lactose metabolism in K. lactis. In chromatin-immunoprecipitation experiments we demonstrate binding of both, KlSip4 and KlCat8, to selected CSREs and provide evidence that KlSip4 counteracts KlCat8-mediated transcription activation by competing for binding to some but not all CSREs. The finding that the hierarchical relationship of these transcription factors differs between K. lactis and S. cerevisiae and that the sets of target genes have diverged contributes to explaining the phenotypic differences in metabolic life-style.
Journal of Bioinformatics and Computational Biology | 2007
Stefan Posch; Jan Grau; André Gohr; Irad Ben-Gal; Alexander E. Kel; Ivo Grosse
Variable order Markov models and variable order Bayesian trees have been proposed for the recognition of cis-regulatory elements, and it has been demonstrated that they outperform traditional models such as position weight matrices, Markov models, and Bayesian trees for the recognition of binding sites in prokaryotes. Here, we study to which degree variable order models can improve the recognition of eukaryotic cis-regulatory elements. We find that variable order models can improve the recognition of binding sites of all the studied transcription factors. To ease a systematic evaluation of different model combinations based on problem-specific data sets and allow genomic scans of cis-regulatory elements based on fixed and variable order Markov models and Bayesian trees, we provide the VOMBATserver to the public community.
Methods of Molecular Biology | 2010
Stefan Posch; Jan Grau; André Gohr; Jens Keilwagen; Ivo Grosse
Many different computer programs for the prediction of transcription factor binding sites have been developed over the last decades. These programs differ from each other by pursuing different objectives and by taking into account different sources of information. For methods based on statistical approaches, these programs differ at an elementary level from each other by the statistical models used for individual binding sites and flanking sequences and by the learning principles employed for estimating the model parameters. According to our experience, both the models and the learning principles should be chosen with great care, depending on the specific task at hand, but many existing programs do not allow the user to choose them freely. Hence, we developed Jstacs, an object-oriented Java framework for sequence analysis, which allows the user to combine different statistical models and different learning principles in a modular manner with little effort. In this chapter we explain how Jstacs can be used for the recognition of transcription factor binding sites.
Journal of Bioinformatics and Computational Biology | 2013
Jan Grau; Jens Keilwagen; André Gohr; Ivan A. Paponov; Stefan Posch; Michael Seifert; Marc Strickert; Ivo Grosse
DNA-binding proteins are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in target regions of genomic DNA. However, de-novo discovery of these binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not yet been solved satisfactorily. Here, we present a detailed description and analysis of the de-novo motif discovery tool Dispom, which has been developed for finding binding sites of DNA-binding proteins that are differentially abundant in a set of target regions compared to a set of control regions. Two additional features of Dispom are its capability of modeling positional preferences of binding sites and adjusting the length of the motif in the learning process. Dispom yields an increased prediction accuracy compared to existing tools for de-novo motif discovery, suggesting that the combination of searching for differentially abundant motifs, inferring their positional distributions, and adjusting the motif lengths is beneficial for de-novo motif discovery. When applying Dispom to promoters of auxin-responsive genes and those of ABI3 target genes from Arabidopsis thaliana, we identify relevant binding motifs with pronounced positional distributions. These results suggest that learning motifs, their positional distributions, and their lengths by a discriminative learning principle may aid motif discovery from ChIP-chip and gene expression data. We make Dispom freely available as part of Jstacs, an open-source Java library that is tailored to statistical sequence analysis. To facilitate extensions of Dispom, we describe its implementation using Jstacs in this manuscript. In addition, we provide a stand-alone application of Dispom at http://www.jstacs.de/index.php/Dispom for instant use.
Archive | 2018
Constance Mehlgarten; Ralf Eggeling; André Gohr; Markus Bönn; Ioana M. Lemnian; Martin Nettling; Katharina Strödecke; Carolin Kleindienst; Ivo Grosse; Karin D. Breunig
Alterations in gene regulation are considered major driving forces in divergent evolution. This is reflected in different species by the variable architecture of regulatory networks controlling highly conserved metabolic pathways. While many regulatory proteins are surprisingly conserved their wiring has evolved more rapidly. This project focuses on the adaptation to nutrient limitation, which requires the activation of the conserved AMP-activated protein kinase (AMPK alias Snf1 in yeast) and its downstream effectors. The goal is to uncover basic principles of adaptation and steps in the evolutionary process associated with regulatory network rearrangement. This requires improving the prediction of gene regulation based experimental data, DNA sequence information and information theory. In this project Context Tree (CT) models and Parsimonious Context Tree (PCT) models and the corresponding algorithms for extended Context Tree Maximization (CTM) and extended Parsimonious Context Tree Maximization (PCTM) are derived, implemented, and applied. Computational predictions and experimental validation will establish an iterative cycle to improve algorithms in each cycle leading to a growing set of experimentally verified and falsified predictions, finally allowing a deeper understanding of the evolution of the transcriptional regulatory network controlling energy metabolism, one of the most fundamental processes conserved across all kingdoms of life.
Bioinformatics | 2018
André Gohr; Manuel Irimia
Summary Tracking thousands of alternative splicing (AS) events genome-wide makes their downstream analysis computationally challenging and laborious. Here, we present Matt, the first UNIX command-line toolkit with focus on high-level AS analyses. With 50 commands it facilitates computational AS analyses by (i) expediting repetitive data-preparation tasks, (ii) offering routine high-level analyses, including the extraction of exon/intron features, discriminative feature detection, motif enrichment analysis, and the generation of motif RNA-maps, (iii) improving reproducibility by documenting all analysis steps and (iv) accelerating the implementation of own analysis pipelines by offering users to exploit its modular functionality. Availability and implementation matt.crg.eu under GNU LGPLv3, together with comprehensive documentation and application examples. Matt is implemented in Perl and R, invokes pdfLATEX and depends only on Perl Core modules/the R Base package simplifying its installation. Supplementary information Supplementary data are available at Bioinformatics online.