Solon P. Pissis
King's College London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Solon P. Pissis.
workshop on algorithms in bioinformatics | 2015
Roberto Grossi; Costas S. Iliopoulos; Robert Mercaş; Nadia Pisanti; Solon P. Pissis; Ahmad Retha; Fatima Vayani
Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.
bioinformatics and biomedicine | 2010
Tomáš Flouri; Jan Holub; Costas S. Iliopoulos; Solon P. Pissis
The constant advances in sequencing technology have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these reads to reference sequences are in great demand, and recently, there has been ample work for publishing such algorithms. In this paper, we study a different version of this problem: mapping these reads to a dynamically changing genomic sequence. We propose a new practical algorithm, which employs a suitable data structure that takes into account potential dynamic effects (replacements, insertions, deletions) on the genomic sequence. The presented experimental results demonstrate that the proposed approach can be applied to address the problem of mapping millions of reads to multiple genomic sequences.
Advances in Experimental Medicine and Biology | 2010
Pavlos Antoniou; Costas S. Iliopoulos; Laurent Mouchard; Solon P. Pissis
Novel high-throughput (Deep) sequencing technology methods have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads) in a single experiment and with a much lower cost than previous sequencing methods. In this paper, we present a new algorithm for addressing the problem of efficiently mapping millions of short reads to a reference genome. In particular, we define and solve the Massive Approximate Pattern Matching problem for mapping short sequences to a reference genome.
international conference on bioinformatics | 2010
Kimon Frousios; Costas S. Iliopoulos; Laurent Mouchard; Solon P. Pissis; German Tischler
Motivation: The constant advances in sequencing technology are turning whole-genome sequencing into a routine procedure, resulting in massive amounts of data that need to be processed. Tens of gigabytes of data in the form of short reads need to be mapped back to reference sequences, a few gigabases long. A first generation of short read alignment software successfully employed hash tables, and the current second generation uses Burrows-Wheeler Transform, further improving mapping speed. However, there is still demand for faster and more accurate mapping. Results: In this paper, we present REad ALigner, an efficient, accurate and consistent tool for aligning short reads obtained from next generation sequencing. It is based on a new, simple, yet efficient mapping algorithm that can match and outperform current BWT-based software.
combinatorial pattern matching | 2010
Maxime Crochemore; Costas S. Iliopoulos; Solon P. Pissis; German Tischler
A proper factor u of a string y is a cover of y if every letter of y is within some occurrence of u in y. The concept generalises the notion of periods of a string. An integer array C is the minimal-cover (resp. maximal-cover) array of y if C[i] is the minimal (resp. maximal) length of covers of y[0 . . i], or zero if no cover exists. In this paper, we present a constructive algorithm checking the validity of an array as a minimal-cover or maximal-cover array of some string. When the array is valid, the algorithm produces a string over an unbounded alphabet whose cover array is the input array. All algorithms run in linear time due to an interesting combinatorial property of cover arrays: the sum of important values in a cover array is bounded by twice the length of the string.
Algorithms for Molecular Biology | 2014
Carl Barton; Costas S. Iliopoulos; Solon P. Pissis
AbstractBackgroundCircular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area.ResultsIn this article, we present a suboptimal average-case algorithm for exact circular string matching requiring time O(n) . Based on our solution for the exact case, we present two fast average-case algorithms for approximate circular string matching with k-mismatches, under the Hamming distance model, requiring time O(n) for moderate values of k, that is k=O(m/logm) . We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach.ConclusionsWe present two fast average-case algorithms for approximate circular string matching with k-mismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.
ieee international conference on information technology and applications in biomedicine | 2009
Pavlos Antoniou; Jackie W. Daykin; Costas S. Iliopoulos; Derrick G. Kourie; Laurent Mouchard; Solon P. Pissis
Novel high throughput sequencing technology methods have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads) in a single experiment and with a much lower cost than previous sequencing methods. Due to this massive amount of data generated by the above systems, efficient algorithms for mapping short sequences to a reference genome are in great demand. In this paper, we present a practical algorithm for addressing the problem of efficiently mapping uniquely occuring short reads to a reference genome. This requires the classification of these short reads into unique and duplicate matches. In particular, we define and solve the Massive Exact Unique Pattern Matching problem in genomes.
language and automata theory and applications | 2015
Carl Barton; Costas S. Iliopoulos; Solon P. Pissis
Approximate string matching is the problem of finding all factors of a text \(t\) of length \(n\) that are at a distance at most \(k\) from a pattern \(x\) of length \(m\). Approximate circular string matching is the problem of finding all factors of \(t\) that are at a distance at most \(k\) from \(x\) or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time \(\mathcal {O}(n(k + \log m) /m)\). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using \(x\) and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach.
BMC Bioinformatics | 2014
Solon P. Pissis
BackgroundIdentifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets—for instance, large-scale ChIP-Seq datasets—cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds.ResultsIn this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets.ConclusionsUse of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/.
BMC Bioinformatics | 2014
Carl Barton; Alice Héliou; Laurent Mouchard; Solon P. Pissis
BackgroundAn absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n2)-time and O(n)-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available.ResultsOur contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw.ConclusionsClassical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.