[PDF] PaREM: A Novel Approach for Parallel Regular Expression Matching

Abstract

Regular expression matching is essential for many applications, such as finding patterns in text, exploring substrings in large DNA sequences, or lexical analysis. However, sequential regular expression matching may be time-prohibitive for large problem sizes. In this paper, we describe a novel algorithm for parallel regular expression matching via deterministic finite automata. Furthermore, we present our tool PaREM that accepts regular expressions and finite automata as input and automatically generates the corresponding code for our algorithm that is amenable for parallel execution on shared-memory systems. We evaluate our parallel algorithm empirically by comparing it with a commonly used algorithm for sequential regular expression matching. Experiments on a dual-socket shared-memory system with 24 physical cores show speed-ups of up to 21x for 48 threads.

Full PDF

PPaREM: A Novel Approach for Parallel RegularExpression Matching (CSE-2014, ©IEEE)

Suejb Memeti and Sabri Pllana

Department of Computer Science, Linnaeus University351 95 V¨axj¨o, Sweden { suejb.memeti, sabri.pllana } @lnu.se Abstract —Regular expression matching is essential for manyapplications, such as ﬁnding patterns in text, exploring substringsin large DNA sequences, or lexical analysis. However, sequentialregular expression matching may be time-prohibitive for largeproblem sizes. In this paper, we describe a novel algorithmfor parallel regular expression matching via deterministic ﬁniteautomata. Furthermore, we present our tool PaREM that acceptsregular expressions and ﬁnite automata as input and automati-cally generates the corresponding code for our algorithm that isamenable for parallel execution on shared-memory systems. Weevaluate our parallel algorithm empirically by comparing it witha commonly used algorithm for sequential regular expressionmatching. Experiments on a dual-socket shared-memory systemwith 24 physical cores show speed-ups of up to × for 48threads. Index Terms —parallel processing, multi-core, regular expres-sion, ﬁnite automata

I. I

NTRODUCTION

There are many relevant applications of regular expres-sion matching (REM) and ﬁnite automata (FA) includingDNA sequence matching [1], network intrusion detection [2],and information extraction from web based documents [3].The computational complexity of pattern ﬁnding grows withincreasing the number of states of the automaton and thesize of the input. While the stagnation in processor clockrates promises no performance increases for sequential im-plementations of REM, availability of affordable multicoreprocessors provides opportunities for signiﬁcant improvement.For instance, the recently introduced Intel® Xeon® ProcessorE7-8890 v2 manufactured at 22nm comprises 15 physicalcores and supports 30 threads or so called logical cores.Shared-memory systems with up to eight processors of thistype are feasible that would lead to a system with 240 logicalcores. To exploit these powerful systems, scalable parallelREM implementations are required.Programming and resolving problems within automata the-ory is a relatively complex and time-consuming process, andstill the results may not be reliable because of the chancesto have an incorrect FA representation. Furthermore, efﬁcientparallel programming of multicore systems is complex andthis issue is known in the literature as the ”programmabilitywall” [4]. Democratization of parallel REM would beneﬁtfrom tools that hide parallel programming from the end-userand automatically generate the correct parallel implementation that is ready for compilation and efﬁcient execution.Various approaches for increasing the performance of REMevaluation have been proposed. For instance, Maine [5] is alibrary for data-parallel FA, which formalizes the evaluation ofa FA as a matrix multiplication. Holub and Stekr [6] proposean algorithm for parallel execution of synchronized determinis-tic ﬁnite automata (DFA). Yang and Prassana [7] introduce anapproach that uses segmentation for regular expression evalua-tion via nondeterministic ﬁnite automata (NFA). In [8] authorspropose the range-coalesced representation of transition tableto optimize the cost of the transition table lookup for eachactive state. While there are model to text generators (such as,Acceleo [9]), or RE to NFA-DFA converters (such as, JFLAP[10]), to our best knowledge there are no automatic parallelcode generators for RE or FA.In this paper, we describe a novel algorithm for ParallelRegular Expression Matching (PaREM) that scales grace-fully for various problem sizes and number of threads. Thealgorithm was devised to be efﬁcient for general automataindependently from the number of states, and for large spec-trum of input text-sizes. Our algorithm is optimized to dovery accurate speculations on the possible initial states foreach of the sub inputs (split among the available processingunits), instead of calculating the possible routes consideringeach state of the automaton as initial state. This method ismore effective when the adjacency matrix (used for graphrepresentation of the automaton) is sparse, although it showsmajor improvements in dense matrices as well. To ease theaccess to the proposed parallel algorithm for a broad spectrumof users (including the users without background in parallelprogramming), we have developed our tool PaREM that cantransform automatically a Regular Expression (RE) or FA intothe corresponding code (C++ and OpenMP) for our algorithmthat is amenable for parallel execution on shared-memorysystems. Experimental results on a dual-socket shared-memorysystem with 24 physical cores show a close to linear speedupcompared to the sequential implementation for problem sizescomparable to the cache size and signiﬁcant speedup for largerproblem sizes that use further levels of memory hierarchy.The main contributions of this paper include: • A scalable algorithm for parallel regular expressionmatching; • PaREM tool that automatically generates parallel code a r X i v : . [ c s . F L ] J un rom a given regular expression or ﬁnite automata; • Empirical evaluation of the proposed parallel algorithmand the PaREM tool using a modern dual-socket shared-memory system with 24 physical cores.The rest of the paper is organized as follows. Section IIprovides background information on regular expressions andﬁnite automata and presents our parallel algorithm. SectionIII describes the implementation of the PaREM tool, andSection IV the corresponding experimental evaluation. Thework described in this paper is compared and contrasted to therelated work in Section V. Section VI provides a summary ofour work and a description of future work.II. M

ETHODOLOGY

A. Background

A regular expression is a string for describing search pat-terns. A ﬁnite automaton is a graph-based way for specifyingpatterns [11]. Finite automata and regular expressions may beused in pattern ﬁnding algorithms.Deterministic Finite Automata (DFA) is a quintuple of ( Q, Σ , δ, q , F ) , where Q is a ﬁnite set of states, Σ is set ofsymbols (alphabet), δ : Q × Σ → Q is the transition function, q is the initial state and F is the set of ﬁnal states [11] [12].A DFA operates in the following manner: when a programstarts, the current state is assumed to be the initial state q ,on each character the current symbol is supposed to move toanother state (including itself). When the input reaches the lastcharacter, the string is accepted if and only if the current stateis in the set of ﬁnal states. It is called deterministic becausein each state and for each input symbol a unique transition isdeﬁned.Nondeterministic Finite Automata (NFA) is deﬁned by thequintuple ( Q, Σ , δ, q , F ) as in DFA except the alphabet maycontain an empty symbol; the transition function returns a setof states rather than a single state. It is called non-deterministicbecause of the choice of moves that may lead from one stateto another. B. Parallel REM Algorithm

Existing approaches for parallel REM (such as [6]) split theinput into smaller substrings among all or a selected numberof processing units, run the automaton on each of them, andjoin the sub- results. While other approaches calculate thepossible initial states from each state of the automaton, ouralgorithm takes a step ahead by excluding all the states thatthe automaton has no outgoing or incoming transitions for thespeciﬁed characters. Calculating the possible routes from eachstate of the automaton becomes time-consuming and memory-expensive for large ﬁnite automatons.The basic idea of the sequential REM or DFA is that onestarts from q and after n (input length) steps another statefrom set Q is reached. Its time complexity depends only onthe input length.Our algorithm is based on domain decomposition, whichmeans it slices the input in p parts (see Algorithm 1), where p is the number of processing units (line 3). For each p i Fig. 1: Automaton A for matching the pattern parallel the possible initial states R are determined by ﬁnding theintersection of possible initial states R = S ∩ L (line 5 — 15). S is the set of initial states for the ﬁrst character of T p i (thatis, the sliced input for this speciﬁc processor) where q i ∈ S if ∃ : δ ( q i , T p i ) . L is the set of initial states for last characterof T p − i , where L i = δ ( q i , T p − i ) . Each chunk of the inputis mapped to a processing unit, and each processing unit isresponsible for ﬁnding the possible initial states for its ownchunk of the input. The processing unit with ID = 0 alreadyknows the possible initial state, that is q , so a calculationfor determining the possible initial states is not necessary. Foreach state in R , a REM is done and the result is stored in I (lines 16 — 25).When all processors have ﬁnished their jobs, a binaryreduction of the ﬁnal results is completed. The reduction isdone by connecting the last active state of P i to the ﬁrstactive state of P i − . The connection is accepted only if atransition from last active state of P i to the ﬁrst active stateof P i +1 exists with the ﬁrst character of the sub-result of nextprocessor T p +1 , δ ( q i , T p +1 i ) . An input is accepted only if foreach processor there exist a sub route, which can be connectedwith the result of the previous and next processor’s result, andthe last state of the automaton is member of the ﬁnal stateset. The worst-case scenario would be if all the states havethe same input and output transitions. C. Description of PaREM Algorithm with an Example

To show how the possible initial states are determined, thefollowing example from Fig. 1 is used. Let T be an inputstring, T = ” plaraparallelapareparapl ” and assume that wewill use four processing units (that is threads).The transition table corresponding to the automaton fromFig. 1 is shown on Table I. The transition table for thisautomaton is dense, which will produce a dense adjacencymatrix.The input length is 24 characters, so when split amongprocessing units we get four substrings of six characters( P = ” plarap ” , P = ” aralle ” , P = ” lapare ” and P = ” parapl ” ). Table II shows the accurate possible initialstates found for each of the processor’s input, and the visitedstates starting from each of the possible initial states. In thisexample, each state has exactly the same amount of outgoing lgorithm 1 Parallel Regular Expression Matching (PaREM)%Input: Transition table Tt, set of ﬁnal states F, input T%%Output: Result of REM% I = vector ( p ) /* initialize ﬁnal result vector */% P ...P p ... processing unit, p is the total number ofprocessing units % for P , P , ..., P p do in parallel start position = i ∗ ( T.length/p ) pi input = substring ( start position, T.length/p ) %start ﬁnd possible initial states % for q , q , ..., q n do % pi input.at (0) returns the ﬁrst char of pi input % if ( T t [ q i ][ pi input.at (0)] ∈ Q ) then S [ i ] = q i end if end for for q , q , ..., q n do % pi input.back () returns the last char of pi input % if ( T t [ q i ][ pi input.back ()] ∈ Q ) then L [ i ] = T t [ q i ][ pi input.back ()] end if end for %end ﬁnd possible initial states % R = S ∩ L %intersection of possible initial and laststates % for r ∈ R do Rr = vector ( pi input.length ()) for char ∈ pi input do if ( T t [ r ][ char ] ∈ F ) then f ound + + end if Rr [ i ] = r = T t [ r ][ char ] end for I [ i ] .push back ( Rr ) end for end for % Wait for the slowest processor%% Perform a reduction of I%TABLE I: Transition table for automaton on Fig. 1 δ A p a r e l0 1 0 0 0 01 1 2 0 0 02 1 0 3 0 03 1 4 0 0 04 1 0 0 0 55 1 0 0 0 66 1 0 0 7 07 1 0 0 0 88 1 0 0 0 0 transitions, which means there is a transition from each statefor each symbol of the alphabet.The set of DFA initial states R is equal to the set of states TABLE II: Possible initial states for P , P , P and P S ∩ L Visited states P P P P L achieved from the last character of the input string of theprevious processor, because S is equal to set of all states.Therefore, R = S ∩ L = L . This applies only to densetransition tables, because from each state on any symbol ispossible to go to another state (including itself). In practice,most of DFA produce a sparse transition table. In sparsetransition tables the set of states S achieved from the ﬁrstcharacter of the input string that is mapped to the processingunit, is determined by the outgoing transitions of states fora speciﬁc character. We treat each matrix as sparse, that iswhy R = S ∩ L . It is possible to identify a sparse matrix, butinspecting each element of large matrices whether is empty ornot may be time-consuming.The highlighted numbers on Table I represent the set ofstates S and L for P , where S (colored in green) is set ofsource states for which a transition exist on ” l ” (ﬁrst characterof the input mapped to P ), and L (colored in yellow) is setof unique destination states for which a transition exists on ” e ” (last character of the input string mapped to P ).The general enumeration approach of REM algorithmscalculates possible routes (moving from one state to another)considering each state of the automaton as initial state. Inthis example, the enumeration approach of REM would haveperformed × (three processing units ( P , P and P )would start from all the nine possible states, and P wouldstart from state q ) calculations. Our algorithm performs onlyﬁve calculations for this example, and we believe that thisnumber becomes lower for sparse transition tables. If the inputof processing unit P i − would end with ” l ” , there would befour (0, 5, 6, 8) possible initial states. The worst-case scenariowould be if each of the sub-inputs ends with ” l ” ; in such case × calculations are performed for dense matrices that isan improvement by .

15 = (3 ∗ / (3 ∗ , comparedto the general approach.III. I MPLEMENTATION

Fig. 2 depicts our PaREM tool, which takes as input a RE ora FA and generates the corresponding C++ code representationof the given RE or FA. The generated C++ code includesOpenMP [13] directives and routines and is in accordance withour Algorithm 1. In the process of PaREM implementation, wehave speciﬁed a context-free grammar to deﬁne the languagethat accepts regular expressions as input. Table III lists theaccepted operators by PaREM context-free language.The

Klenee Star denotes zero or more occurrences of asymbol or sub-expression (for instance, φ, a, aa, aaa , where φ is an empty transition). The NFA representation of theig. 2: The use of PaREM tool for translating regular expressions into equivalent ﬁnite automata (NFA, then DFA) andgenerating source code (C++ and OpenMP) that represent the same given RE or FA Klenee Star is shown in Fig. 4d. The

Positive Closure alsoknown as

Repetition is an extended operator of the

KleneeStar , which denotes one or more occurrences of a symbol orsub-expression (for instance, a + , Fig. 4f, is equal to aa ∗ thatresults to these possibilities: a, aa, aaa, ).The Union operator (represented as NFA in Fig. 4c), ex-pressed by a vertical bar, provides the possibility to choosebetween two or more sub-expressions (such as, a, b ). The

Range (deﬁned based on ASCII code order) operator, or

Character Class , is an extended operator of

Union , insteadof writing | | | the Range operator [0 .. can be used. Itapplies to integers and characters.The Optionality operator (shown as NFA in Fig. 4e) denoteszero or one occurrence of a symbol or sub-expression (forinstance, a ? = φ | a ). The Group operator is introduced tochange the operator precedences. For instance, a | b ∗ and ( a | b ) ∗ produce different results, in the ﬁrst example the Klenee Star operator has priority over the

Union operator, while in thesecond example the

Union operator has a higher priority. Bycombining these operations (using

Concatenation operator,Fig. 4b) arbitrarily complex regular expressions can be written.For each RE a speciﬁc Abstract Syntax Tree (AST) isgenerated that represents the abstract syntactic structure of theRE. For easier translation into a target structure, additionaldetails have been added (such as, the node type) to the AST.The generated AST can have an arbitrary number of sub-trees,which in essence are ASTs [14]. Fig. 3 shows an exampleof how an AST is constructed for a given RE. Dashed-linecompartments indicate the sub-trees.The priority of the

Union operator over the

Quantiﬁer

TABLE III: PaREM’s Accepted Regular Expressions Opera-tors

Operator Name Description ab Concatenation b right after aa ∗ Klenee Star zero or more a ’s a | b Union either a or ba + Positive closure one or more a ’s [0 .. Range either , ... or a ? Optionality zero or one a ( ab | c ) ∗ Group zero or more ofeither ab ’s or c ’s operator in the sub-expression ”( a | b )?” is depicted in Fig. 3.The deeper the operator is in the AST hierarchy, the higherpriority it has.We transform the AST into NFA graph using the McNaughton-Yamada-Thompson Algorithm . To preserve theoperator priority, the depth-ﬁrst search traversal of the treeis performed while constructing the NFA graph. Each ofthe sub-expressions creates a sub-graph, which are mergedin the main graph using empty transitions. Removing theunnecessary empty transitions further optimizes the ﬁnal NFA.The optimized NFA for the RE example in Fig. 3 is shown inFig. 4g.Fig. 4a — 4f depicts the transformation process for eachoperator from the RE (or AST) into an equivalent NFA.Using the

Subset Construction Algorithm [15], the opti-mized NFA is converted into an equivalent DFA. During thistransformation, the PaREM creates a log ﬁle with the transitionig. 3: Abstract Syntax Tree representation for ( a | b )? c ∗ [0 .. b + RE (a) RE input: a (b) RE input: ab (c) RE input: a | b (d) RE input: a ∗ (e) RE input: a ? (f) RE input: a + (g) RE input: ( a | b )? c ∗ [0 .. b + Fig. 4: Transformation of RE operators into NFAtable. Theoretically, the DFAs number of states may have anexponential relationship to the NFAs number of states, whichleads to the well-known state explosion issue. However, mostof the real-world NFA produce a DFA with approximately thesame number of states.Finally, from the DFA we generate executable source codethat implements the REM for the corresponding DFA [14][16]. There are different possible ways of representing a DFA,but we have selected two different forms: (1) if-else statements, TABLE IV: System Conﬁguration

OperatingSystem CentOS 6.2 (Linux kernel2.6.32)Processor × Intel® Xeon® Proces-sor E5-2695 v2 (2.40GHz,30MB Cache, 12 Cores)RAM × GBOpenMP 3.1 and (2) graphs .The if-else approach is a straightforward way of imple-menting a DFA. This approach creates an if-statement foreach transition of the automaton. However, this approach isnot recommended for large automatons. The if-else approachprovides a sequential solution for regular expression matching.The graph-based approach provides an easy way toadd/remove transitions or states in the automaton, and con-sequently reduces the risk of having incorrect representationof the automaton.For graph-based representation in the source code, we haveused an adjacency matrix , which represents the transitiontable. This approach has faster lookups to check for thepresence or absence of a speciﬁc transition, compared to the adjacency list representation of the automaton. The graph-based solution provides the implementation of the parallelregular expression-matching algorithm presented in this paper.IV. E

XPERIMENTAL E VALUATION

For experimental purposes, an automaton that ﬁnds alloccurrences of the word ”parallel” has been implemented,which results with an automaton with nine states (shown onFig. 1) and an alphabet of ﬁve characters. Table IV lists themajor features of experimentation platform. We use a shared-memory system with two 12-core Intel® Xeon® processorsof the type E5-2695 v2 for evaluation of our approach. Eachof the 12 physical cores supports two threads (also known aslogical cores). In total, our system has 24 physical cores or48 logical cores.Fig. 5a — 5e depicts the performance results for ﬁve prob-lem sizes and various numbers of threads. Each experimenthas been repeated 20 times to address the random performanceﬂuctuations. The string length determines the problem size andin our experiment, we used ﬁve strings of following lengths:6.69e+07, 1.34e+08, 2.68e+08, 5.36e+08 and 1.07e+09.Execution times are shown in Fig. 5a — 5e, whereas thespeedup is depicted in Fig. 5f. The speed up for the smallestinput length (6.69e+07 characters) in our set of experimentsclosely follows the linear speedup up to 24 threads (Fig.5f). For larger input lengths, we may observe noteworthyspeedup improvements for 24 and 48 threads. Consideringall experiments the highest speedup of × was achieved forinput length 6.69e+07 characters and 48 threads.Table V shows the inﬂuence of input length in the cachemisses and the speedup. We varied the input length using 24and 48 threads. With the increase of input length, the number a) input length: 6.69e+07 (b) input length: 1.34e+08(c) input length: 2.68e+08 (d) input length: 5.36e+08(e) input length: 1.07e+09 (f) speedup Fig. 5: Performance results. As input are used ﬁve strings of the following lengths: 6.69e+07, 1.34e+08, 2.68e+08, 5.36e+08and 1.07e+09. Execution times are shown in (a e), whereas the speedup is shown in (f). The speed up for the smallest inputlength (6.69e+07 characters) in our set of experiments closely follows the linear speedup up to 24 threads. The maximumspeedup of × is achieved for 48 threads and input string of 6.69e+07 characters.ABLE V: Inﬂuence of Input length in cache misses andspeedup for 24 and 48 threads Input Length 24 threads 48 threadsCacheMisses[106] Speedup CacheMisses[106] Speedup6.69e+07 36.34 19.32 36.76 21.081.34e+07 70.15 10.44 71.07 17.272.68e+08 167.57 7.87 140.57 11.815.36e+08 339.26 7.18 367.07 9.991.07e+09 681.71 5.62 716.02 6.69 of cache misses increases and the speedup decreases. For thesmallest input length in our set of experiments (6.69e+07characters) that largely ﬁts in the available cache, using 24threads, the number of cache misses is 36.34e+06 and thespeedup is . × . For the largest input length (1.07e+09) weobtained 681.71e+06 cache misses and a speedup of . × .The obtained cache misses for 48 threads are comparableto those for 24 threads (see Table V). For the smallest inputlength the number of cache misses is 36.76e+06 and thespeedup is . × . For the largest input length (1.07e+09)we obtained 716.02e+06 cache misses and a speedup of . × . We may observe that for all tested input lengths thespeedup-gain when 48 logical cores (hyper-threading) are usedcompared to 24 physical cores. A. Performance comparison of PaREM algorithm with theGeneral Enumeration Approach

The main difference of the PaREM algorithm and theGeneral Enumeration Approach (Enum) proposed by [6] isthe way of speculation of the next set of possible initial statesfor each chunk of the input string. While the Enum algorithmfor general DFAs considers all the states of the automaton asinitial states, the PaREM algorithm ﬁnds the most accurateinitial states. Comparing to PaREM that requires only ﬁvecalculations to ﬁnd the correct path, the Enum algorithmrequires 28 calculations to be performed in order to ﬁnd thecorrect initial states for the example described in section II.C.We have run the experiment example from section II.Cwith the same input sizes and number of threads for theGeneral Enumeration approach as well. Figure 6a — 6edepicts the impact of ﬁnding the most accurate initial statesin the time execution. The sequential version (running in onethread) is the same for both algorithms, because they startthe calculations from state q on processing unit P . TheEnumeration Approach requires more calculations for ﬁniteautomata with larger number of states, larger input size andfor higher number of processing units.The execution time of the Enumeration Approach comparedto the PaREM algorithm increases as we increase either theinput size or the number of threads. The execution time ofPaREM is . × better than Enum, which is achieved inthe largest number of threads (48) and the biggest problemsize(1.07e+09), and only . × better than Enum for the (a) input length: 6.69e+07(b) input length: 1.34e+08(c) input length: 2.68e+08(d) input length: 5.36e+08(e) input length: 1.07e+09 Fig. 6: Comparison between PaREM algorithm and GeneralEnumeration Approach.smallest number of threads (6) and the smallest input size(6.69+e07).. R

ELATED W ORK

Holub and Stekr [6] propose an approach for parallel REMvia DFA by splitting the input string in small chunks andrunning these chunks on each core, but due to pre-calculationof initial states for each sub input, this was not efﬁcient forgeneral DFA. Their algorithm runs efﬁciently for a speciﬁctype of DFA, so called synchronizing automata, that relies onthe input automaton being k-local.Yang and Prassana [7] propose the segmentation of regularexpressions and perform the REM evaluation via nondetermin-istic ﬁnite automata. The major aim is to optimize the use ofmemory hierarchy in case of automata with many states andlarge transition table. In contrast to our approach, the authorsof [7] focus on large automata but do not address speciﬁcallyalgorithmic optimizations with respect to large input strings.Mytkowicz and Schulte [8] propose an approach that ex-ploits SIMD, instruction and thread level parallelism in thecontext of ﬁnite state machines computations. To increase theopportunities for data-parallelism authors of [11] have deviseda method for breaking data-dependencies with enumeration.This approach is not based on speculation with respect toinitial state determination.Kumar et al. [17] address the issue of large-scale ﬁniteautomata (also known as the state explosion problem) bysplitting regular expressions into two parts: (1) a preﬁx thatcontains frequently visited parts of the automata, and (2) asufﬁx that is the rest of the automaton. The aim is to have asmall DFA for frequently accessed parts of automata that ﬁtsin cache memory.Luchaup et al. [18] propose an approach of ﬁnding thecorrect initial state by speculation. They believe that guessingthe state of the DFA at certain position (network intrusiondetection DFA based scanning spends most of the time in afew hot states) has a very good chance that after a few stepswill reach the correct state. They validate these guesses usinga history of speculated states. In comparison to our algorithm,the convergence of the guessed state and the correct state isnot guaranteed. Furthermore, if a thread does not converge onits sub input, then the next thread is forced to start from a newstate, which limits the scalability [8].Our algorithm is based on splitting the input into smallersub-inputs (domain decomposition); however, we have deviseda method to bypass the need of pre-calculation of all initialstates by ﬁnding the most accurate possible initial states. Ourapproach is not limited to a particular type of DFA, and isefﬁcient for a large spectrum of input sizes.In contrast to the related work, our tool is capable ofautomatically generating a ready to compile and execute codefor shared-memory systems, by taking as input a RE or FA.VI. S

UMMARY AND F UTURE W ORK

Regular expression matching is essential for many applica-tions such as lexical analysis, data mining [19], or networksecurity. We have presented a parallel algorithm for regularexpression matching that is based on our improved speculativedetermination of initial states. Our tool PaREM transforms automatically any regular ex-pression or ﬁnite automata into the corresponding parallel code(C++ and OpenMP), and consequently eases the access to theproposed parallel algorithm for the users without backgroundin parallel programming. Preliminary experimental resultsshow that the performance of our algorithm gracefully scalesfor various string lengths and numbers of threads. For an inputstring of 6.69e+07 characters, we obtained a speedup of × with 48 threads.In future, we plan to evaluate our approach for other typesof problems, such as DNA sequencing or Network IntrusionDetection Systems. We also plan to extend our implementationfor heterogeneous systems.R EFERENCES[1] A. Nowzari-Dalini, E. Elahi, H. Ahrabian, and M. Ronaghi, “A newdna implementation of ﬁnite state machines.”

IJCSA , vol. 3, no. 1, pp.51–60, 2006.[2] A. BabuKaruppiah and S. Rajaram, “Deterministic ﬁnite automata forpattern matching in fpga for intrusion detection,” in

Computer, Com-munication and Electrical Technology (ICCCET), 2011 InternationalConference on , March 2011, pp. 167–170.[3] R. Kosala, M. Bruynooghe, J. V. den Bussche, and H. Blockeel,“Information extraction from web documents based on local unrankedtree automaton inference.” in

IJCAI , G. Gottlob and T. Walsh, Eds.Morgan Kaufmann, 2003, pp. 403–408.[4] S. Pllana, S. Benkner, E. Mehofer, L. Natvig, and F. Xhafa, “Towardsan intelligent environment for programming multi-core computing sys-tems.” in

Euro-Par Workshops , ser. Lecture Notes in Computer Science,vol. 5415. Springer, 2008, pp. 141–151.[5] T. Mytkowicz and W. Schulte, “Maine: A library for data parallel ﬁniteautomata,” Tech. Rep. MSR-TR-2012-62, July 2012. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=168379[6] J. Holub and S. Stekr, “On parallel implementations of deterministicﬁnite automata.” in

CIAA , ser. Lecture Notes in Computer Science,S. Maneth, Ed., vol. 5642. Springer, 2009, pp. 54–64.[7] Y.-H. E. Yang and V. K. Prasanna, “Optimizing regular expressionmatching with sr-nfa on multi-core systems.” in

PACT

Foundations of Computer Science: C Edition ,ser. Principles of computer science series. W. H. Freeman, 1994.[12] J. Hopcroft and J.D. Ullman,

Introduction to Automata Theory, Lan-guages and Computation

Compilers: Principles, Techniques and Tools (for Anna Univer-sity), 2/e . Pearson Education India.[15] C.-H. Chang and R. Paige, “From regular expressions to dfa’s usingcompressed nfa’s.”

Theor. Comput. Sci. , vol. 178, no. 1-2, pp. 1–36,1997.[16] A. Arora and A. Shefali Bansal,

Comprehensive Computer and Lan-guages . Laxmi Publications, 2005.[17] S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese, “Curingregular expressions matching algorithms from insomnia, amnesia, andacalculia.” in

ANCS , R. Yavatkar, D. Grunwald, and K. K. Ramakrish-nan, Eds. ACM, 2007, pp. 155–164.[18] D. Luchaup, R. Smith, C. Estan, and S. Jha, “Speculative parallel patternmatching.”

IEEE Transactions on Information Forensics and Security ,vol. 6, no. 2, pp. 438–451, 2011.[19] R. Trasarti, F. Bonchi, and B. Goethals, “Sequence mining automata: Anew technique for mining frequent sequences under regular expressions,”in