Mary Ellen Bock
Purdue University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mary Ellen Bock.
Journal of Computational Biology | 2000
Alberto Apostolico; Mary Ellen Bock; Stefano Lonardi; Xuyan Xu
Words that are, by some measure, over- or underrepresented in the context of larger sequences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. Here we take the global approach of annotating the suffix tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree annotations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of significance. This result is achieved by an ad hoc embedding in statistical expressions of the combinatorial structure of the periods of a string. Specifically, we show that the expected value and variance of all substrings in a given sequence of n symbols can be computed and stored in (optimal) O(n2) overall worst-case, O (n log n) expected time and space. The O (n2) time bound constitutes an improvement by a linear factor over direct methods. Moreover, we show that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the theta(n2) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global detectors of favored and unfavored words for our probabilistic framework in overall linear time and space, discuss related software implementations and display the results of preliminary experiments.
research in computational molecular biology | 2002
Alberto Apostolico; Mary Ellen Bock; Stefano Lonardi
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
Biotechnology Advances | 2013
Paola Bertolazzi; Mary Ellen Bock; Concettina Guerra
A number of interesting issues have been addressed on biological networks about their global and local properties. The connection between the topological properties of proteins in Protein-Protein Interaction (PPI) networks and their biological relevance has been investigated focusing on hubs, i.e. proteins with a large number of interacting partners. We will survey the literature trying to answer the following questions: Do hub proteins have special biological properties? Do they tend to be more essential than non-hub proteins? Are they more evolutionarily conserved? Do they play a central role in modular organization of the protein interaction network? Are there structural properties that characterize hub proteins?
Journal of Computational Biology | 2003
Alberto Apostolico; Mary Ellen Bock; Stefano Lonardi
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
combinatorial pattern matching | 2005
Mary Ellen Bock; Guido M. Cortelazzo; Carlo Ferrari; Concettina Guerra
We apply a spin image representation for 3D objects used in computer vision to the problem of comparing protein surfaces. Due to the irregularities of the protein surfaces, this is a much more complex problem than comparing regular and smooth surfaces. The spin images capture local features in a way that is useful for finding related active sites on the surface of two proteins. They reduce the three-dimensional local information to two dimensions which is a significant computational advantage. We try to find a collection of pairs of points on the two proteins such that the corresponding members of the pairs for one of the proteins form a surface patch for which the corresponding spin images are a “match”. Preliminary results are presented which demonstrate the feasibility of the method.
computational systems bioinformatics | 2007
Mary Ellen Bock; Claudio Garutti; Concettina Guerra
We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.
Theoretical Computer Science | 2008
Mary Ellen Bock; Claudio Garutti; Concettina Guerra
We developed a suite of methods for the problem of protein binding site recognition, based on a representation of the protein structures by a collection of spin-images. A procedure for cavity detection is coupled with a method previously developed for the recognition of similar regions in two proteins, and applied to the comparison of two proteins cavities, the all-to-all pairwise comparison of a set of cavities, and the recognition of multiple binding sites in one cavity. All the presented methods can be used to screen large collections of proteins. The detection of cavities in a given protein is often the preliminary step in protein binding site recognition, since binding sites usually lie in cavities. The comparison of two cavities identifies two similar regions in the two cavities, and hints at a common functional structure when one or both regions include a binding site. The all-to-all pairwise comparison of a set of cavities is clustered according to the measure of similarity of the cavities, obtaining a clustering that groups together cavities with the same binding sites, when their structures are similar enough. Recognition of multiple binding sites in one cavity is performed by the comparison of a cavity, called background cavity, with a dataset of cavities, and clustering its residues that match the residues of other cavities in the data set. The four methods are benchmarked on different databases, and their effectiveness is discussed.
digital identity management | 1999
Mary Ellen Bock; Concettina Guerra
We present a novel geometric approach to extract planes from sets of 3D points. For a set with n points the algorithm has an O(n/sup 3/ log n) time complexity. We also discuss an implementation of the algorithm for range image segmentation. The performance of the new range image segmentation algorithm is compared to other existing methods.
Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171) | 1997
Alberto Apostolico; Mary Ellen Bock; Xuyan Xu
A statistical index for string x is a digital-search tree or trie that returns, for any query string /spl omega/ and in a number of comparisons bounded by the length of /spl omega/, the number of occurrences of /spl omega/ in x. Clever algorithms are available that support the construction and weighting of such indices in time and space linear in the length of x. This paper addresses the problem of annotating a statistical index with such parameters as the expected value and variance of the number of occurrences of each substring.
data compression conference | 1999
Alberto Apostolico; Mary Ellen Bock; Stefano Lonardi
The identification of strings that are, by some measure, redundant or rare in the context of larger sequences is an implicit goal of any data compression method. In the straightforward approach to searching for unusual substrings, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. As is well known, clever methods are available to compute and organize the counts of occurrences of all substrings of a given string. The corresponding tables take up the tree-like structure of a special kind of digital search index or trie. We show here that under several accepted measures of deviation from expected frequency, the candidate over- or under-represented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the /spl Theta/(n/sup 2/) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, over-represented, then its extension to the nearest node of the tree is even more so. Based on this, we design global linear detectors of favoured and unfavored words for our probabilistic framework, and display the results of some preliminary that apply our constructions to the analysis of genomic sequences.