Thierry Lecroq | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thierry Lecroq is active.

Explore More

Publication

Featured researches published by Thierry Lecroq.

Algorithms on Strings | 2007

Algorithms on Strings

Maxime Crochemore; Christophe Hancart; Thierry Lecroq

Describing algorithms in a C-like language, this text presents examples related to the automatic processing of natural language, to the analysis of molecular sequences and to the management of textual databases.

Algorithmica | 1994

Speeding up two string-matching algorithms

Maxime Crochemore; Artur Czumaj; Leszek Gasieniec; Stefan Jarominek; Thierry Lecroq; Wojciech Plandowski; Wojciech Rytter

We show how to speed up two string-matching algorithms: the Boyer-Moore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern. The main feature of both algorithms is that they scan the text right-to-left from the supposed right position of the pattern. The BM algorithm goes as far as the scanned segment (factor) is a suffix of the pattern. The RF algorithm scans while the segment is a factor of the pattern. Both algorithms make a shift of the pattern, forget the history, and start again. The RF algorithm usually makes bigger shifts than BM, but is quadratic in the worst case. We show that it is enough to remember the last matched segment (represented by two pointers to the text) to speed up the RF algorithm considerably (to make a linear number of inspections of text symbols, with small coefficient), and to speed up the BM algorithm (to make at most 2 ·n comparisons). Only a constant additional memory is needed for the search phase. We give alternative versions of an accelerated RF algorithm: the first one is based on combinatorial properties of primitive words, and the other two use the power of suffix trees extensively. The paper demonstrates the techniques to transform algorithms, and also shows interesting new applications of data structures representing all subwords of the pattern in compact form.

Information Processing Letters | 2007

Fast exact string matching algorithms

Thierry Lecroq

String matching is the problem of finding all the occurrences of a pattern in a text. We propose a very fast new family of string matching algorithms based on hashing q-grams. The new algorithms are the fastest on many cases, in particular, on small size alphabets.

ACM Computing Surveys | 2013

The exact online string matching problem: A review of the most recent results

Simone Faro; Thierry Lecroq

This article addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, information retrieval, data compression, computational biology and chemistry. In the last decade more than 50 new algorithms have been proposed for the problem, which add up to a wide set of (almost 40) algorithms presented before 2000. In this article we review the string matching algorithms presented in the last decade and present experimental results in order to bring order among the dozens of articles published in this area.

Software - Practice and Experience | 1995

Experimental results on string matching algorithms

Thierry Lecroq

We present experimental results for string matching algorithms which are known to be fast in practice. We compare these algorithms through two aspects: the number of text character inspections and the running time. These experiments show that for large alphabets and small patterns the Quick Search algorithm of Sunday is the most efficient and that for small alphabets and large patterns it is the Reverse Factor algorithm of Crochemore et al. which is the most efficient.

ACM Computing Surveys | 1996

Pattern-matching and text-compression algorithms

Maxime Crochemore; Thierry Lecroq

Pattern matching is the problem of locating a specific pattern inside raw data. The pattern is usually a collection of strings described in some formal language. Applications require two kinds of solution depending upon which string, the pattern, or the text, is given first. Solutions based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern. The notion of indices realized by trees or automata is used in the second kind of solutions. The aim of data compression is to provide representation of data in a reduced form in order to save both storage place and transmission time. There is no loss of information, the compression processes are reversible. Pattern-matching and text-compression algorithms are two important subjects in the wider domain of text processing. They apply to the manipulation of texts (word editors), to the storage of textual data (text compression), and to data retrieval systems (full text search). They are basic components used in implementations of practical softwares existing under most operating systems. Moreover, they emphasize programming methods that serve as paradigms in other fields of computer science (system or software design). Finally, they also play an important role in theoretical computer science by providing challenging problems. Although data are recorded in various ways, text remains the main way to exchange information. This is particularly evident in literature or linguistics where data are composed of huge corpora and dictionaries, but applies as well to computer science where a large amount of data is stored in linear files. And it is also the case, for instance, in molecular biology because biological molecules can often be approximated as sequences of nucleotides or amino acids. Furthermore, the quantity of available data in these fields tend to double every 18 months. This is the reason that algorithms must be efficient even if the speed and storage capacity of computers increase continuously.

Bioinformatics | 2003

FORRepeats: detects repeats on entire chromosomes and between genomes.

Arnaud Lefebvre; Thierry Lecroq; Hélène Dauchel; Joël Alexandre

MOTIVATION As more and more whole genomes are available, there is a need for new methods to compare large sequences and transfer biological knowledge from annotated genomes to related new ones. BLAST is not suitable to compare multimegabase DNA sequences. MegaBLAST is designed to compare closely related large sequences. Some tools to detect repeats in large sequences have already been developed such as MUMmer or REPuter. They also have time or space restrictions. Moreover, in terms of applications, REPuter only computes repeats and MUMmer works better with related genomes. RESULTS We present a heuristic method, named FORRepeats, which is based on a novel data structure called factor oracle. In the first step it detects exact repeats in large sequences. Then, in the second step, it computes approximate repeats and performs pairwise comparison. We compared its computational characteristics with BLAST and REPuter. Results demonstrate that it is fast and space economical. We show FORRepeats ability to perform intra-genomic comparison and to detect repeated DNA sequences in the complete genome of the model plant Arabidopsis thaliana.

Information Processing Letters | 1999

Fast practical multi-pattern matching

Maxime Crochemore; Artur Czumaj; L. Ga̧sieniec; Thierry Lecroq; Wojciech Plandowski; Wojciech Rytter

The multi-pattern matching problem consists in finding all occurrences of the patterns from a finite set X in a given textT of length n. We present a new and simple algorithm combining the ideas of the Aho‐Corasick algorithm and the directed acyclic word graphs. The algorithm has time complexity which is linear in the worst case (it makes at most 2n symbol comparisons) and has good average-case time complexity assuming the shortest pattern is sufficiently long. Denote the length of the shortest pattern bym, and the total length of all patterns by M. Assume thatM is polynomial with respect tom, the alphabet contains at least 2 symbols and the text (in which the pattern is to be found) is random, for each position each letter occurs independently with the same probability. Then the average number of comparisons is O..n=m/ logm/, which matches the lower bound of the problem. For sufficiently large values of m the algorithm has a good behavior in practice.

International Journal of Foundations of Computer Science | 2009

EFFICIENT VARIANTS OF THE BACKWARD-ORACLE-MATCHING ALGORITHM

Simone Faro; Thierry Lecroq; Litis Ea

In this article we present two efficient variants of the BOM string matching algorithm which are more efficient and flexible than the original algorithm. We also present bitparallel versions of them obtaining an efficient variant of the BNDM algorithm. Then we compare the newly presented algorithms with some of the most recent and effective string matching algorithms. It turns out that the new proposed variants are very flexible and achieve very good results, especially in the case of large alphabets.

combinatorial pattern matching | 1992

A variation on the Boyer-Moore algorithm

Thierry Lecroq

String-matching consists in finding all the occurrences of a word w in a text t. Several algorithms have been found for solving this problem. They are presented by Aho in a recent book [l]. Among these algorithms, the Boyer-Moore approach [S, 1 l] seems to lead to the fastest algorithms for the search phase. Even if the original version of the Bayer-Moore algorithm has a quadratic worst case, its behavior in practice seems to be sublinear. Furthermore, other authors [9,2] have improved this worst-case time complexity for the search phase so that it becomes linear in the length of the text. The best bound for the number of letter comparisons is due to Apostolico and Giancarlo [2] and is 2n-m+ 1, where n is the length of the text and m the length of the word. Another particularity of the Boyer-Moore algorithm is that the study of its complexity is not obvious; see [lo, 73. Basically, the Boyer-Moore algorithm tries to find for a given position in the text the longest suffix of the word which ends at that position. A new approach can possess the ability for a given position in the text to compute the length of the longest prefix of the word which ends at that position. When we know this length, we are able to compute a better shift than the Boyer-Moore approach. In the first version we make a new attempt at matching, forgetting all the previous prefixes matched. This leads to a very simple algorithm but it has a quadratic worst-case running time. In an improved version we memorize the position where the previous longest prefix found ends and we make a new attempt at matching only the number of characters corresponding to the complement of this prefix. We are then able to compute a shift without reading again backwards more than half the characters of the prefix found in the previous attempt. This leads to a linear-time algorithm which scans the text characters at most three times each.

Explore More