Is this you? Create Your Porfile

Lee Aaron Newberg

New York State Department of Health

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lee Aaron Newberg is active.

Explore More

Publication

Featured researches published by Lee Aaron Newberg.

symposium on discrete algorithms | 1993

Physical mapping of chromosomes: a combinatorial problem in molecular biology

Farid Alizadeh; Richard M. Karp; Lee Aaron Newberg; Deborah K. Weisser

This paper is concerned wth the physical mapping of DNA molecules using data about the hybridization of oligonucleotide probes to a library of clones. In mathematical terms, the DNA molecule corresponds to an interval on the real line, each clone to a subinterval, and each probe occurs at a finite set of points within the interval. A stochastic model for the occurrences of the probes and the locations of the clones is assumed. Given a matrix of incidences between probes and clones, the task is to reconstruct the most likely interleaving of the clones. Combinatorial algorithms are presented for solving approximations to this problem, and computational results are presented.

Nucleic Acids Research | 2007

The Gibbs Centroid Sampler

William A. Thompson; Lee Aaron Newberg; Sean Conlan; Lee Ann McCue; Charles E. Lawrence

The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments. In so doing, it garners information from the full ensemble of solutions, rather than only the single most probable point that is the target of many motif-finding algorithms, including its predecessor, the Gibbs Recursive Sampler. Centroid estimators have been shown to yield substantial improvements, in both sensitivity and positive predictive values, to the prediction of RNA secondary structure and motif finding. The Gibbs Centroid Sampler, along with interactive tutorials, an online user manual, and information on downloading the software, is available at: http://bayesweb.wadsworth.org/gibbs/gibbs.html.

Bioinformatics | 2007

A phylogenetic Gibbs sampler that yields centroid solutions for cis-regulatory site prediction

Lee Aaron Newberg; William A. Thompson; Sean Conlan; Thomas M. Smith; Lee Ann McCue; Charles E. Lawrence

MOTIVATION Identification of functionally conserved regulatory elements in sequence data from closely related organisms is becoming feasible, due to the rapid growth of public sequence databases. Closely related organisms are most likely to have common regulatory motifs; however, the recent speciation of such organisms results in the high degree of correlation in their genome sequences, confounding the detection of functional elements. Additionally, alignment algorithms that use optimization techniques are limited to the detection of a single alignment that may not be representative. Comparative-genomics studies must be able to address the phylogenetic correlation in the data and efficiently explore the alignment space, in order to make specific and biologically relevant predictions. RESULTS We describe here a Gibbs sampler that employs a full phylogenetic model and reports an ensemble centroid solution. We describe regulatory motif detection using both simulated and real data, and demonstrate that this approach achieves improved specificity, sensitivity, and positive predictive value over non-phylogenetic algorithms, and over phylogenetic algorithms that report a maximum likelihood solution. AVAILABILITY The software is freely available at http://bayesweb.wadsworth.org/gibbs/gibbs.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal of Computational Biology | 2009

Exact Calculation of Distributions on Integers, with Application to Sequence Alignment

Lee Aaron Newberg; Charles E. Lawrence

Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.

Algorithms for Molecular Biology | 2007

PhyloScan: identification of transcription factor binding sites using cross-species evidence

C. Steven Carmack; Lee Ann McCue; Lee Aaron Newberg; Charles E. Lawrence

BackgroundWhen transcription factor binding sites are known for a particular transcription factor, it is possible to construct a motif model that can be used to scan sequences for additional sites. However, few statistically significant sites are revealed when a transcription factor binding site motif model is used to scan a genome-scale database.MethodsWe have developed a scanning algorithm, PhyloScan, which combines evidence from matching sites found in orthologous data from several related species with evidence from multiple sites within an intergenic region, to better detect regulons. The orthologous sequence data may be multiply aligned, unaligned, or a combination of aligned and unaligned. In aligned data, PhyloScan statistically accounts for the phylogenetic dependence of the species contributing data to the alignment and, in unaligned data, the evidence for sites is combined assuming phylogenetic independence of the species. The statistical significance of the gene predictions is calculated directly, without employing training sets.ResultsIn a test of our methodology on synthetic data modeled on seven Enterobacteriales, four Vibrionales, and three Pasteurellales species, PhyloScan produces better sensitivity and specificity than MONKEY, an advanced scanning approach that also searches a genome for transcription factor binding sites using phylogenetic information. The application of the algorithm to real sequence data from seven Enterobacteriales species identifies novel Crp and PurR transcription factor binding sites, thus providing several new potential sites for these transcription factors. These sites enable targeted experimental validation and thus further delineation of the Crp and PurR regulons in E. coli.ConclusionBetter sensitivity and specificity can be achieved through a combination of (1) using mixed alignable and non-alignable sequence data and (2) combining evidence from multiple sites within an intergenic region.

Journal of Computational Biology | 2008

Significance of gapped sequence alignments.

Lee Aaron Newberg

Measurement of the the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 +/- 0.3) x 10(-1314). Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.

BMC Bioinformatics | 2009

Error statistics of hidden Markov model and hidden Boltzmann model results

Lee Aaron Newberg

BackgroundHidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold?ResultsHere we present a novel general approach to estimating these false positive and true positive rates that is significantly more efficient than are existing general approaches. We validate the technique via an implementation within the HMMER 3.0 package, which scans DNA or protein sequence databases for patterns of interest, using a profile-HMM.ConclusionThe new approach is faster than general naïve sampling approaches, and more general than other current approaches. It provides an efficient mechanism by which to estimate error statistics for hidden Markov model and hidden Boltzmann model results.

Nucleic Acids Research | 2010

Phyloscan: locating transcription-regulating binding sites in mixed aligned and unaligned sequence data

Michael J. Palumbo; Lee Aaron Newberg

The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a genes promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).

Bioinformatics | 2008

Memory-efficient dynamic programming backtrace and pairwise local sequence alignment

Lee Aaron Newberg

Motivation: A backtrace through a dynamic programming algorithms intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis. Results: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000. Availability: Sample C++-code for optimal backtrace is available in the Supplementary Materials. Contact: [email protected] Supplementary information: Supplementary data is available at Bioinformatics online.

Archive | 2012

Finding Protein Binding Sites Using Volunteer Computing Grids

Travis Desell; Lee Aaron Newberg; Malik Magdon-Ismail; Boleslaw K. Szymanski; William A. Thompson

This paper describes initial work in the development of the DNA@Home volunteer computing project, which aims to use Gibbs sampling for the identification and location of DNA control signals on full genome scale data sets. Most current research involving sequence analysis for these control signals involve significantly smaller data sets, however volunteer computing can provide the necessary computational power to make full genome analysis feasible. A fault tolerant and asynchronous implementation of Gibbs sampling using the Berkeley Open Infrastructure for Network Computing (BOINC) is presented, which is currently being used to analyze the intergenic regions of the Mycobacterium tuberculosis genome. In only three months of limited operation, the project has had over 1,800 volunteered computing hosts participate and obtains a number of samples required for analysis over 400 times faster than an average computing host for the Mycobacterium tuberculosis dataset. We feel that the preliminary results for this project provide a strong argument for the feasibility and public interest of a volunteer computing project for this type of bioinformatics.

Explore More