Richard Beal
West Virginia University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Richard Beal.
Journal of Discrete Algorithms | 2012
Richard Beal; Donald A. Adjeroh
The challenge of direct parameterized suffix sorting (p-suffix sorting) for a parameterized string (p-string), say T of length-n, is the dynamic nature of the n parameterized suffixes (p-suffixes) of T. In this work, we propose transformative approaches to direct p-suffix sorting by generating and sorting lexicographically numeric fingerprints and arithmetic codes that correspond to individual p-suffixes. Our algorithm to p-suffix sort via fingerprints is the first theoretical linear time algorithm for p-suffix sorting for non-binary parameter alphabets, which assumes that, in practice, all codes are within the range of an integral data type. We eliminate the key problems of fingerprints by introducing an algorithm that exploits the ordering of arithmetic codes to sort p-suffixes in linear time on average. The arithmetic coding approach is further extended to handle p-strings in the worst case. This algorithm is the first direct p-suffix sorting approach in theory to execute in o(n^2) time in the worst case, which improves on the best known theoretical result on this problem that sorts p-suffixes based on p-suffix classifications in O(n^2) time. We show that, based on the algorithmic parameters and the input data, our algorithm does indeed execute in linear time in various cases, which is confirmed with experimental results.
Journal of Discrete Algorithms | 2012
Richard Beal; Donald A. Adjeroh
The parameterized longest previous factor (pLPF) problem as defined for parameterized strings (p-strings) adds a level of parameterization to the longest previous factor (LPF) problem originally defined for traditional strings. In this work, we consider the construction of the pLPF data structure and identify the strong relationship between the pLPF linear time construction and several variations of the problem. Initially, we propose a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the pLPF and popular data structures. It is shown that a subset of longest factor problems may be created with the pLPF construction. More specifically, the pLPF problem is used as a foundation to achieve the linear time construction of popular data structures such as the LCP, parameterized-LCP (pLCP), parameterized-border (p-border) array, and border array. We further generalize the permuted-LCP for p-strings and provide a linear time construction. A number of new variations of the pLPF problem are proposed and addressed in linear time for both p-strings and traditional strings, including the longest not-equal factor (LneF), longest reverse factor (LrF), and longest factor (LF). The framework of the pLPF construction is exploited to efficiently address a multitude of data structures with prospects in various applications. Finally, we implement our algorithms and perform various experiments to confirm theoretical results.
Theoretical Computer Science | 2016
Richard Beal; Donald A. Adjeroh
Pattern matching between traditional strings is well-defined for both uncompressed and compressed sequences. Prior to this work, parameterized pattern matching (p-matching) was defined predominately by the matching between uncompressed parameterized strings (p-strings) from the constant alphabet Σ and the parameter alphabet ?. In this work, we define the compressed parameterized pattern matching (compressed p-matching) problem to find all of the p-matches between a pattern P and text T, using only P and the compressed text T c . Initially, we present parameterized compression (p-compression) as a new way to losslessly compress data. Experimentally, we show that p-compression is competitive with various other standard compression schemes. Subsequently, we provide the compression and decompression algorithms. Next, two different approaches are developed to address the compressed p-matching problem: (1) using the recently proposed parameterized arithmetic codes (pAC) and (2) using the parameterized border array (p-border). Our general solution is independent of the underlying compression scheme. The results are further examined for catenate, Tunstall codes, Huffman codes, and LZSS.
Theoretical Computer Science | 2015
Richard Beal; Donald A. Adjeroh
We propose efficient methods to address key pattern matching problems in RNA secondary structures using the notion of structural strings. A structural string (s-string) is composed of constant symbols and parameter symbols from the alphabets Σ and ?, respectively. An individual symbol in the ? alphabet may be considered a complement of another unique symbol in ?. The notion of matching constants, parameters, and complements is referred to as the structural matching (s-match) problem, which is helpful in matching RNA and previously, was solved by the structural suffix tree (sST). Other approaches to RNA matching that do not openly consider the s-match include the use of affix data structures. In this paper, we provide new data structures and algorithms to address the s-match problem. Specifically, we introduce the structural suffix array and structural longest common prefix array and then identify how to s-match with these data structures. Our new s-matching solution is then used as the framework to answer various combinatorial queries encountered in matching RNA secondary structures.
data compression conference | 2013
Richard Beal; Donald A. Adjeroh
Traditional pattern matching between strings, from the alphabet Σ, is well defined for both uncompressed and compressed sequences. Prior to this work, parameterized pattern matching (p-matching) was defined predominately by the matching between uncompressed parameterized strings (p-strings) from the constant alphabet Σ and the parameter alphabet II. In this work, we define the compressed parameterized pattern matching (compressed p-matching) problem to find all of the p-matches between a pattern P and text T, using only P and the compressed text Tc. Initially, we present parameterized compression (p-compression) as a new way to losslessly compress data to support p-matching. Experimentally, we show that p-compression is competitive with various other standard compression schemes. Subsequently, we provide the compression and decompression algorithms. Using p-compression, we address the compressed p-matching problem. Our general solution is independent of the underlying compression scheme. The results are further examined for the specific case of Tunstall codes.
Journal of Discrete Algorithms | 2013
Richard Beal; Donald A. Adjeroh
The border and parameterized border (p-border) arrays are data structures used in pattern matching applications for traditional strings from the constant alphabet @S, and parameterized strings (p-strings) from the constant alphabet @S and the parameter alphabet @P. In this work, we introduce the structural border array (s-border) as defined for an n-length structural string (s-string) T. The s-string is a p-string with the existence of symbol complements in some alphabet @C. These different alphabets add to both the intricacies and capabilities of pattern matching. For example, the s-string can handle the Watson-Crick base pairings in biological sequences and thus, assists in applications that deal with efficient pattern matching between RNA strands that share similar secondary structures. Initially, we provide a construction that executes in O(n^2) time to build the s-border array. The paper establishes theory to improve the result to O(n) by proving particular properties of the s-border data structure. This result is significant because of the generalization of the s-string, which is a step beyond the p-string. Using the same construction algorithm, we show how to modify the s-string alphabets to also construct the p-border and the traditional border arrays in linear time. The generality of the s-border construction algorithm motivates us to devise pattern matching algorithms for s-matching, p-matching, and traditional matching. Our pattern matching algorithms are ultimately used to address the p-match problem with run-length encoded strings.
international workshop on combinatorial algorithms | 2012
Richard Beal; Donald A. Adjeroh
The border and parameterized border (p-border) arrays are data structures used in pattern matching applications for traditional strings from the constant alphabet Σ and parameterized strings (p-strings) from the constant alphabet Σ and the parameter alphabet Π. In this work, we introduce the structural border (s-border) array as defined for an n-length structural string (s-string) T. The s-string is a p-string with the existence of symbol complements in some alphabet Γ. These different alphabets add to both the intricacies and capabilities of pattern matching. Initially, we provide a construction that executes in O(n 2) time to build the s-border array. The paper establishes theory to improve the result to O(n) by proving particular properties of the s-border data structure. This result is significant because of the generalization of the s-string, which is a step beyond the p-string. Using the same construction algorithm, we show how to modify the s-string alphabets to also construct the p-border and the traditional border arrays in linear time.
BMC Genomics | 2016
Richard Beal; Tazin Afrin; Aliya Farheen; Donald A. Adjeroh
BackgroundThe longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data.MethodsFirst, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself.ResultsOur basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011).ConclusionWe propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.
international conference on bioinformatics | 2013
Richard Beal; Donald A. Adjeroh; Ahmed Abbasi
With the rapid growth in available genomic data, robust and efficient methods for identifying RNA secondary structure elements, such as hairpins, have become a significant challenge in computational biology, with potential applications in prediction of RNA secondary and tertiary structures, functional classification of RNA structures, micro RNA target prediction, and discovery of RNA structure motifs. In this work, we propose the Forward Stem Matrix (FSM), a data structure to efficiently represent all k-length stem options, for k ∈ K, within an n-length RNA sequence T. We show that the FSM structure is of size O(n|K|) and still permits efficient access to stems. In this paper, we provide a linear O(n|K|) construction for the FSM using suffix arrays and data structures related to the Longest Previous Factor (LPF), namely, the Furthest Previous Non-Overlapping Factor (FPnF) and Furthest Previous Factor (FPF) arrays. We also provide new constructions for the FPnF and FPF via a novel application of parameterized string (p-string) theory and suffix trees. As an application of the FSM, we show how to efficiently find all hairpin structures in an RNA sequence. Experimental results show the practical performance of the proposed data structures.
international workshop on combinatorial algorithms | 2011
Richard Beal; Donald A. Adjeroh
The longest previous factor (LPF) problem is defined for traditional strings exclusively from the constant alphabet Σ. A parameterized string (p-string) is a sophisticated string composed of symbols from a constant alphabet Σ and a parameter alphabet Π. We generalize the LPF problem to the parameterized longest previous factor (pLPF) problem defined for p-strings. Subsequently, we present a linear time solution to construct the pLPF array. Given our pLPF algorithm, we show how to construct the pLCP (parameterized longest common prefix) array in linear time. Our algorithm is further exploited to construct the standard LPF and LCP arrays all in linear time.