Sharma V. Thankachan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sharma V. Thankachan is active.

Explore More

Publication

Featured researches published by Sharma V. Thankachan.

Journal of the ACM | 2014

Space-Efficient Frameworks for Top- k String Retrieval

Wing-Kai Hon; Rahul Shah; Sharma V. Thankachan; Jeffrey Scott Vitter

The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D={d1, d2,d3, …, dD} of D strings with n characters in total taken from an alphabet set Σ = [σ], and the task of the search engine, for a given query pattern P of length p, is to report the “most relevant” strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P,dr), which indicates how relevant document dr is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document. The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan [SODA 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highest-scoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O(p + klog k) query time. The query time can be made optimal O(p+k) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.

research in computational molecular biology | 2015

Efficient Alignment Free Sequence Comparison with Bounded Mismatches

Srinivas Aluru; Alberto Apostolico; Sharma V. Thankachan

Alignment free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogentic reconstruction. Among the methods based on substring composition, the Average Common Substring (\(\mathsf {ACS}\)) measure proposed by Burstein et al. (RECOMB 2005) admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate \(k \ge 1\) mismatches have \(O(kn^2)\) worst-case complexity, worse than the \(O(n^2)\) alignment algorithms they are meant to replace. On the other hand, accounting for mismatches does show to lead to much improved classification, while heuristics can improve practical performance. In this paper, we close the gap by presenting the first provably efficient algorithm for the \(k\) -mismatch average common string (\(\mathsf {ACS}_k\)) problem that takes \(O(n)\) space and \(O(n\log ^{k+1} n)\) time in the worst case for any constant \(k\). Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applicable to other complex approximate sequence matching problems.

BMC Bioinformatics | 2017

A greedy alignment-free distance estimator for phylogenetic inference

Sharma V. Thankachan; Sriram P. Chockalingam; Yongchao Liu; Ambujam Krishnan; Srinivas Aluru

BackgroundAlignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings.ResultsWe present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed.ConclusionsALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package (http://alurulab.cc.gatech.edu/phylo).

research in computational molecular biology | 2018

Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis

Sharma V. Thankachan; Chaitanya Aluru; Sriram P. Chockalingam; Srinivas Aluru

We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of \(O(\log ^k n)\) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of \(n^2 k^{1.5}/ 2^{\varOmega (\sqrt{{\log n}/{k}})}\) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in \(O(n\log ^k n)\) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software.

BMC Genomics | 2017

Efficient detection of viral transmissions with Next-Generation Sequencing data

Inna Rytsareva; David S. Campo; Yueli Zheng; Seth Sims; Sharma V. Thankachan; Cansu Tetik; Jain Chirag; Sriram P. Chockalingam; Amanda Sue; Srinivas Aluru; Yury Khudyakov

BackgroundHepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed.The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples.MethodsWe developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes.ResultsOur three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold.ConclusionsWe present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.

Theoretical Computer Science | 2017

In-place algorithms for exact and approximate shortest unique substring problems

Wing-Kai Hon; Sharma V. Thankachan; Bojian Xu

Abstract We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate k -mismatch SUS finding, using the minimum 2 n memory words, each of ⌈ log 2 ⁡ ( n ) ⌉ bits, plus n bytes space, where n is the input string size. By using the in-place framework, we can find the exact and approximate k -mismatch SUS for every string position using a total of O ( n ) and O ( n 2 ) time, respectively, regardless of the value of k . Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement. Experimental study shows that the peak memory usage of our proposal is consistently 9 n bytes for any string of size n , validating the claim that our solution is in-place. Further, our proposal uses much less memory and is much faster than the currently best work that has implementation for exact SUS finding.

symposium on discrete algorithms | 2017

pBWT: achieving succinct data structures for parameterized pattern matching and related problems

Arnab Ganguly; Rahul Shah; Sharma V. Thankachan

The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last two decades. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of the suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-preserving suffix links. Specifically the relative order between two suffixes in the subtree of an internal node is same as that of the suffixes obtained by truncating the first character of the two suffixes. Unfortunately, in many variants of the text-indexing problem, for e.g., parameterized pattern matching, 2D pattern matching, and order-isomorphic pattern matching, this property does not hold. Consequently, the compressed indexes based on BWT do not directly apply. Furthermore, a compressed index for any of these variants has been elusive throughout the advancement of the field of succinct data structures. We achieve a positive breakthrough on one such problem, namely the Parameterized Pattern Matching problem. Let T be a text that contains n characters from an alphabet Σ, which is the union of two disjoint sets: Σs containing static characters (s-characters) and Σp containing parameterized characters (p-characters). A pattern P (also over Σ) matches an equal-length substring S of T iff the s-characters match exactly, and there exists a one-to-one function that renames the p-characters in S to that in P. The task is to find the starting positions (occurrences) of all such substrings S. Previous index [Baker, STOC 1993], known as Parameterized Suffix Tree, requires Θ(n log n) bits of space, and can find all occ occurrences in time O(|P|log σ+occ), where σ = |Σ|. We introduce an n log σ + O(n)-bit index with O(|Plog σ+occ·log n log σ) query time. At the core, lies a new BWT-like transform, which we call the Parameterized Burrows-Wheeler Transform (pBWT). The techniques are extended to obtain a succinct index for the Parameterized Dictionary Matching problem of Idury and Schaffer [CPM, 1994].

international conference on bioinformatics | 2018

Faster Computation of Genome Mappability

Sahar Hooshmand; Paniz Abedin; Daniel Gibney; Srinivas Aluru; Sharma V. Thankachan

\beginthebibliography 1 \bibitemalzamel2017faster M. Alzamel, P. Charalampopoulos, C. S. Iliopoulos, S. P. Pissis, J. Radoszewski, and W.-K. Sung. \newblock Faster algorithms for 1-mappability of a sequence. \newblock In \em International Conference on Combinatorial Optimization and Applications, pages 109--121. Springer, 2017. \bibitemderrien2012fast T. Derrien, J. Estellé, S. M. Sola, D. G. Knowles, E. Raineri, R. Guigó, and P. Ribeca. \newblock Fast computation and applications of genome mappability. \newblock \em PloS one, 7(1):e30377, 2012. \bibitemThankachanACA18 S. V. Thankachan, C. Aluru, S. P. Chockalingam, and S. Aluru. \newblock Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. \newblock In \em Research in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings, pages 211--224, 2018. \endthebibliography

international conference on bioinformatics | 2018

A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem

Daniel R. Allen; Sharma V. Thankachan; Bojian Xu

This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of

international conference on bioinformatics | 2018