Mikhail A. Roytberg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mikhail A. Roytberg is active.

Explore More

Publication

Featured researches published by Mikhail A. Roytberg.

Journal of Bioinformatics and Computational Biology | 2006

A UNIFYING FRAMEWORK FOR SEED SENSITIVITY AND ITS APPLICATION TO SUBSET SEEDS

Gregory Kucherov; Laurent Noé; Mikhail A. Roytberg

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem--a set of target alignments, an associated probability distribution, and a seed model--that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2005

Multiseed Lossless Filtration

Gregory Kucherov; Laurent Noé; Mikhail A. Roytberg

We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.

PLOS Computational Biology | 2005

Analysis of Sequence Conservation at Nucleotide Resolution

Saurabh Asthana; Mikhail A. Roytberg; John A. Stamatoyannopoulos; Shamil R. Sunyaev

One of the major goals of comparative genomics is to understand the evolutionary history of each nucleotide in the human genome sequence, and the degree to which it is under selective pressure. Ascertainment of selective constraint at nucleotide resolution is particularly important for predicting the functional significance of human genetic variation and for analyzing the sequence substructure of cis-regulatory sequences and other functional elements. Current methods for analysis of sequence conservation are focused on delineation of conserved regions comprising tens or even hundreds of consecutive nucleotides. We therefore developed a novel computational approach designed specifically for scoring evolutionary conservation at individual base-pair resolution. Our approach estimates the rate at which each nucleotide position is evolving, computes the probability of neutrality given this rate estimate, and summarizes the result in a Sequence CONservation Evaluation (SCONE) score. We computed SCONE scores in a continuous fashion across 1% of the human genome for which high-quality sequence information from up to 23 genomes are available. We show that SCONE scores are clearly correlated with the allele frequency of human polymorphisms in both coding and noncoding regions. We find that the majority of noncoding conserved nucleotides lie outside of longer conserved elements predicted by other conservation analyses, and are experiencing ongoing selection in modern humans as evident from the allele frequency spectrum of human polymorphism. We also applied SCONE to analyze the distribution of conserved nucleotides within functional regions. These regions are markedly enriched in individually conserved positions and short (<15 bp) conserved “chunks.” Our results collectively suggest that the majority of functionally important noncoding conserved positions are highly fragmented and reside outside of canonically defined long conserved noncoding sequences. A small subset of these fragmented positions may be identified with high confidence.

Bioinformatics | 2002

OWEN: aligning long collinear regions of genomes

Aleksey Y. Ogurtsov; Mikhail A. Roytberg; Svetlana A. Shabalina; Alexey S. Kondrashov

OWEN is an interactive tool for aligning two long DNA sequences that represents similarity between them by a chain of collinear local similarities. OWEN employs several methods for constructing and editing local similarities and for resolving conflicts between them. Alignments of sequences of lengths over 10(6) can often be produced in minutes. OWEN requires memory below 20 L, where L is the sum of lengths of the compared sequences.

Journal of Computational Biology | 2000

DNA segmentation through the Bayesian approach.

Vasily Ramensky; V. Ju. Makeev; Mikhail A. Roytberg; V. G. Tumanyan

We present a new approach to DNA segmentation into compositionally homogeneous blocks. The Bayesian estimator, which is applicable for both short and long segments, is used to obtain the measure of homogeneity. An exact optimal segmentation is found via the dynamic programming technique. After completion of the segmentation procedure, the sequence composition on different scales can be analyzed with filtration of boundaries via the partition function approach.

Algorithms for Molecular Biology | 2007

Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules

Valentina Boeva; Julien Clement; Mireille Régnier; Mikhail A. Roytberg; Vsevolod J. Makeev

Backgroundcis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap.ResultsWe developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ksor more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.MethodThe algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m|ℋMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|σ|K) ∏iki) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, |ℋMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is the total number of words in motifs, K is the order of Markov model, and kiis the number of occurrences of the i th motif.ConclusionThe primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.AvailabilityProject web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/

Bioinformatics | 2006

Analysis of internal loops within the RNA secondary structure in almost quadratic time

Aleksey Y. Ogurtsov; Svetlana A. Shabalina; Alexey S. Kondrashov; Mikhail A. Roytberg

MOTIVATION Evaluating all possible internal loops is one of the key steps in predicting the optimal secondary structure of an RNA molecule. The best algorithm available runs in time O(L(3)), L is the length of the RNA. RESULTS We propose a new algorithm for evaluating internal loops, its run-time is O(M(*)log(2)L), M < L(2) is a number of possible nucleotide pairings. We created a software tool Afold which predicts the optimal secondary structure of RNA molecules of lengths up to 28 000 nt, using a computer with 2 Gb RAM. We also propose algorithms constructing sets of conditionally optimal multi-branch loop free (MLF) structures, e.g. the set that for every possible pairing (x, y) contains an optimal MLF structure in which nucleotides x and y form a pair. All the algorithms have run-time O(M(*)log(2)L).

Proteins | 2003

From analysis of protein structural alignments toward a novel approach to align protein sequences

Shamil R. Sunyaev; Gennady A. Bogopolsky; Natalia V. Oleynikova; Peter K. Vlasov; Alexei V. Finkelstein; Mikhail A. Roytberg

Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology‐based modeling of three‐dimensional (3D)‐structure. We investigated correspondence between “gold standard” alignments of 3D protein structures and the sequence alignments produced by the Smith–Waterman algorithm, currently the most sensitive method for pair‐wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith–Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, “islands” between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith–Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith–Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency. Proteins 2003;9999:000–000.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2009

On Subset Seeds for Protein Alignment

Mikhail A. Roytberg; Anna Gambin; Laurent Noé; Sławomir Lasota; Eugenia Furletova; Ewa Szczurek; Gregory Kucherov

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus BLASTP.

combinatorial pattern matching | 2004

Multi-seed Lossless Filtration

Gregory Kucherov; Laurent Noé; Mikhail A. Roytberg

We study a method of seed-based lossless filtration for approximate string matching and related applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.

Explore More