PaREM: A Novel Approach for Parallel Regular Expression Matching
PPaREM: A Novel Approach for Parallel RegularExpression Matching (CSE-2014, ©IEEE)
Suejb Memeti and Sabri Pllana
Department of Computer Science, Linnaeus University351 95 V¨axj¨o, Sweden { suejb.memeti, sabri.pllana } @lnu.se Abstract —Regular expression matching is essential for manyapplications, such as finding patterns in text, exploring substringsin large DNA sequences, or lexical analysis. However, sequentialregular expression matching may be time-prohibitive for largeproblem sizes. In this paper, we describe a novel algorithmfor parallel regular expression matching via deterministic finiteautomata. Furthermore, we present our tool PaREM that acceptsregular expressions and finite automata as input and automati-cally generates the corresponding code for our algorithm that isamenable for parallel execution on shared-memory systems. Weevaluate our parallel algorithm empirically by comparing it witha commonly used algorithm for sequential regular expressionmatching. Experiments on a dual-socket shared-memory systemwith 24 physical cores show speed-ups of up to × for 48threads. Index Terms —parallel processing, multi-core, regular expres-sion, finite automata
I. I
NTRODUCTION
There are many relevant applications of regular expres-sion matching (REM) and finite automata (FA) includingDNA sequence matching [1], network intrusion detection [2],and information extraction from web based documents [3].The computational complexity of pattern finding grows withincreasing the number of states of the automaton and thesize of the input. While the stagnation in processor clockrates promises no performance increases for sequential im-plementations of REM, availability of affordable multicoreprocessors provides opportunities for significant improvement.For instance, the recently introduced Intel® Xeon® ProcessorE7-8890 v2 manufactured at 22nm comprises 15 physicalcores and supports 30 threads or so called logical cores.Shared-memory systems with up to eight processors of thistype are feasible that would lead to a system with 240 logicalcores. To exploit these powerful systems, scalable parallelREM implementations are required.Programming and resolving problems within automata the-ory is a relatively complex and time-consuming process, andstill the results may not be reliable because of the chancesto have an incorrect FA representation. Furthermore, efficientparallel programming of multicore systems is complex andthis issue is known in the literature as the ”programmabilitywall” [4]. Democratization of parallel REM would benefitfrom tools that hide parallel programming from the end-userand automatically generate the correct parallel implementation that is ready for compilation and efficient execution.Various approaches for increasing the performance of REMevaluation have been proposed. For instance, Maine [5] is alibrary for data-parallel FA, which formalizes the evaluation ofa FA as a matrix multiplication. Holub and Stekr [6] proposean algorithm for parallel execution of synchronized determinis-tic finite automata (DFA). Yang and Prassana [7] introduce anapproach that uses segmentation for regular expression evalua-tion via nondeterministic finite automata (NFA). In [8] authorspropose the range-coalesced representation of transition tableto optimize the cost of the transition table lookup for eachactive state. While there are model to text generators (such as,Acceleo [9]), or RE to NFA-DFA converters (such as, JFLAP[10]), to our best knowledge there are no automatic parallelcode generators for RE or FA.In this paper, we describe a novel algorithm for ParallelRegular Expression Matching (PaREM) that scales grace-fully for various problem sizes and number of threads. Thealgorithm was devised to be efficient for general automataindependently from the number of states, and for large spec-trum of input text-sizes. Our algorithm is optimized to dovery accurate speculations on the possible initial states foreach of the sub inputs (split among the available processingunits), instead of calculating the possible routes consideringeach state of the automaton as initial state. This method ismore effective when the adjacency matrix (used for graphrepresentation of the automaton) is sparse, although it showsmajor improvements in dense matrices as well. To ease theaccess to the proposed parallel algorithm for a broad spectrumof users (including the users without background in parallelprogramming), we have developed our tool PaREM that cantransform automatically a Regular Expression (RE) or FA intothe corresponding code (C++ and OpenMP) for our algorithmthat is amenable for parallel execution on shared-memorysystems. Experimental results on a dual-socket shared-memorysystem with 24 physical cores show a close to linear speedupcompared to the sequential implementation for problem sizescomparable to the cache size and significant speedup for largerproblem sizes that use further levels of memory hierarchy.The main contributions of this paper include: • A scalable algorithm for parallel regular expressionmatching; • PaREM tool that automatically generates parallel code a r X i v : . [ c s . F L ] J un rom a given regular expression or finite automata; • Empirical evaluation of the proposed parallel algorithmand the PaREM tool using a modern dual-socket shared-memory system with 24 physical cores.The rest of the paper is organized as follows. Section IIprovides background information on regular expressions andfinite automata and presents our parallel algorithm. SectionIII describes the implementation of the PaREM tool, andSection IV the corresponding experimental evaluation. Thework described in this paper is compared and contrasted to therelated work in Section V. Section VI provides a summary ofour work and a description of future work.II. M
ETHODOLOGY
A. Background
A regular expression is a string for describing search pat-terns. A finite automaton is a graph-based way for specifyingpatterns [11]. Finite automata and regular expressions may beused in pattern finding algorithms.Deterministic Finite Automata (DFA) is a quintuple of ( Q, Σ , δ, q , F ) , where Q is a finite set of states, Σ is set ofsymbols (alphabet), δ : Q × Σ → Q is the transition function, q is the initial state and F is the set of final states [11] [12].A DFA operates in the following manner: when a programstarts, the current state is assumed to be the initial state q ,on each character the current symbol is supposed to move toanother state (including itself). When the input reaches the lastcharacter, the string is accepted if and only if the current stateis in the set of final states. It is called deterministic becausein each state and for each input symbol a unique transition isdefined.Nondeterministic Finite Automata (NFA) is defined by thequintuple ( Q, Σ , δ, q , F ) as in DFA except the alphabet maycontain an empty symbol; the transition function returns a setof states rather than a single state. It is called non-deterministicbecause of the choice of moves that may lead from one stateto another. B. Parallel REM Algorithm
Existing approaches for parallel REM (such as [6]) split theinput into smaller substrings among all or a selected numberof processing units, run the automaton on each of them, andjoin the sub- results. While other approaches calculate thepossible initial states from each state of the automaton, ouralgorithm takes a step ahead by excluding all the states thatthe automaton has no outgoing or incoming transitions for thespecified characters. Calculating the possible routes from eachstate of the automaton becomes time-consuming and memory-expensive for large finite automatons.The basic idea of the sequential REM or DFA is that onestarts from q and after n (input length) steps another statefrom set Q is reached. Its time complexity depends only onthe input length.Our algorithm is based on domain decomposition, whichmeans it slices the input in p parts (see Algorithm 1), where p is the number of processing units (line 3). For each p i Fig. 1: Automaton A for matching the pattern parallel the possible initial states R are determined by finding theintersection of possible initial states R = S ∩ L (line 5 — 15). S is the set of initial states for the first character of T p i (thatis, the sliced input for this specific processor) where q i ∈ S if ∃ : δ ( q i , T p i ) . L is the set of initial states for last characterof T p − i , where L i = δ ( q i , T p − i ) . Each chunk of the inputis mapped to a processing unit, and each processing unit isresponsible for finding the possible initial states for its ownchunk of the input. The processing unit with ID = 0 alreadyknows the possible initial state, that is q , so a calculationfor determining the possible initial states is not necessary. Foreach state in R , a REM is done and the result is stored in I (lines 16 — 25).When all processors have finished their jobs, a binaryreduction of the final results is completed. The reduction isdone by connecting the last active state of P i to the firstactive state of P i − . The connection is accepted only if atransition from last active state of P i to the first active stateof P i +1 exists with the first character of the sub-result of nextprocessor T p +1 , δ ( q i , T p +1 i ) . An input is accepted only if foreach processor there exist a sub route, which can be connectedwith the result of the previous and next processor’s result, andthe last state of the automaton is member of the final stateset. The worst-case scenario would be if all the states havethe same input and output transitions. C. Description of PaREM Algorithm with an Example
To show how the possible initial states are determined, thefollowing example from Fig. 1 is used. Let T be an inputstring, T = ” plaraparallelapareparapl ” and assume that wewill use four processing units (that is threads).The transition table corresponding to the automaton fromFig. 1 is shown on Table I. The transition table for thisautomaton is dense, which will produce a dense adjacencymatrix.The input length is 24 characters, so when split amongprocessing units we get four substrings of six characters( P = ” plarap ” , P = ” aralle ” , P = ” lapare ” and P = ” parapl ” ). Table II shows the accurate possible initialstates found for each of the processor’s input, and the visitedstates starting from each of the possible initial states. In thisexample, each state has exactly the same amount of outgoing lgorithm 1 Parallel Regular Expression Matching (PaREM)%Input: Transition table Tt, set of final states F, input T%%Output: Result of REM% I = vector ( p ) /* initialize final result vector */% P ...P p ... processing unit, p is the total number ofprocessing units % for P , P , ..., P p do in parallel start position = i ∗ ( T.length/p ) pi input = substring ( start position, T.length/p ) %start find possible initial states % for q , q , ..., q n do % pi input.at (0) returns the first char of pi input % if ( T t [ q i ][ pi input.at (0)] ∈ Q ) then S [ i ] = q i end if end for for q , q , ..., q n do % pi input.back () returns the last char of pi input % if ( T t [ q i ][ pi input.back ()] ∈ Q ) then L [ i ] = T t [ q i ][ pi input.back ()] end if end for %end find possible initial states % R = S ∩ L %intersection of possible initial and laststates % for r ∈ R do Rr = vector ( pi input.length ()) for char ∈ pi input do if ( T t [ r ][ char ] ∈ F ) then f ound + + end if Rr [ i ] = r = T t [ r ][ char ] end for I [ i ] .push back ( Rr ) end for end for % Wait for the slowest processor%% Perform a reduction of I%TABLE I: Transition table for automaton on Fig. 1 δ A p a r e l0 1 0 0 0 01 1 2 0 0 02 1 0 3 0 03 1 4 0 0 04 1 0 0 0 55 1 0 0 0 66 1 0 0 7 07 1 0 0 0 88 1 0 0 0 0 transitions, which means there is a transition from each statefor each symbol of the alphabet.The set of DFA initial states R is equal to the set of states TABLE II: Possible initial states for P , P , P and P S ∩ L Visited states P P P P L achieved from the last character of the input string of theprevious processor, because S is equal to set of all states.Therefore, R = S ∩ L = L . This applies only to densetransition tables, because from each state on any symbol ispossible to go to another state (including itself). In practice,most of DFA produce a sparse transition table. In sparsetransition tables the set of states S achieved from the firstcharacter of the input string that is mapped to the processingunit, is determined by the outgoing transitions of states fora specific character. We treat each matrix as sparse, that iswhy R = S ∩ L . It is possible to identify a sparse matrix, butinspecting each element of large matrices whether is empty ornot may be time-consuming.The highlighted numbers on Table I represent the set ofstates S and L for P , where S (colored in green) is set ofsource states for which a transition exist on ” l ” (first characterof the input mapped to P ), and L (colored in yellow) is setof unique destination states for which a transition exists on ” e ” (last character of the input string mapped to P ).The general enumeration approach of REM algorithmscalculates possible routes (moving from one state to another)considering each state of the automaton as initial state. Inthis example, the enumeration approach of REM would haveperformed × (three processing units ( P , P and P )would start from all the nine possible states, and P wouldstart from state q ) calculations. Our algorithm performs onlyfive calculations for this example, and we believe that thisnumber becomes lower for sparse transition tables. If the inputof processing unit P i − would end with ” l ” , there would befour (0, 5, 6, 8) possible initial states. The worst-case scenariowould be if each of the sub-inputs ends with ” l ” ; in such case × calculations are performed for dense matrices that isan improvement by .
15 = (3 ∗ / (3 ∗ , comparedto the general approach.III. I MPLEMENTATION
Fig. 2 depicts our PaREM tool, which takes as input a RE ora FA and generates the corresponding C++ code representationof the given RE or FA. The generated C++ code includesOpenMP [13] directives and routines and is in accordance withour Algorithm 1. In the process of PaREM implementation, wehave specified a context-free grammar to define the languagethat accepts regular expressions as input. Table III lists theaccepted operators by PaREM context-free language.The
Klenee Star denotes zero or more occurrences of asymbol or sub-expression (for instance, φ, a, aa, aaa , where φ is an empty transition). The NFA representation of theig. 2: The use of PaREM tool for translating regular expressions into equivalent finite automata (NFA, then DFA) andgenerating source code (C++ and OpenMP) that represent the same given RE or FA Klenee Star is shown in Fig. 4d. The
Positive Closure alsoknown as
Repetition is an extended operator of the
KleneeStar , which denotes one or more occurrences of a symbol orsub-expression (for instance, a + , Fig. 4f, is equal to aa ∗ thatresults to these possibilities: a, aa, aaa, ).The Union operator (represented as NFA in Fig. 4c), ex-pressed by a vertical bar, provides the possibility to choosebetween two or more sub-expressions (such as, a, b ). The
Range (defined based on ASCII code order) operator, or
Character Class , is an extended operator of
Union , insteadof writing | | | the Range operator [0 .. can be used. Itapplies to integers and characters.The Optionality operator (shown as NFA in Fig. 4e) denoteszero or one occurrence of a symbol or sub-expression (forinstance, a ? = φ | a ). The Group operator is introduced tochange the operator precedences. For instance, a | b ∗ and ( a | b ) ∗ produce different results, in the first example the Klenee Star operator has priority over the
Union operator, while in thesecond example the
Union operator has a higher priority. Bycombining these operations (using
Concatenation operator,Fig. 4b) arbitrarily complex regular expressions can be written.For each RE a specific Abstract Syntax Tree (AST) isgenerated that represents the abstract syntactic structure of theRE. For easier translation into a target structure, additionaldetails have been added (such as, the node type) to the AST.The generated AST can have an arbitrary number of sub-trees,which in essence are ASTs [14]. Fig. 3 shows an exampleof how an AST is constructed for a given RE. Dashed-linecompartments indicate the sub-trees.The priority of the
Union operator over the
Quantifier
TABLE III: PaREM’s Accepted Regular Expressions Opera-tors
Operator Name Description ab Concatenation b right after aa ∗ Klenee Star zero or more a ’s a | b Union either a or ba + Positive closure one or more a ’s [0 .. Range either , ... or a ? Optionality zero or one a ( ab | c ) ∗ Group zero or more ofeither ab ’s or c ’s operator in the sub-expression ”( a | b )?” is depicted in Fig. 3.The deeper the operator is in the AST hierarchy, the higherpriority it has.We transform the AST into NFA graph using the McNaughton-Yamada-Thompson Algorithm . To preserve theoperator priority, the depth-first search traversal of the treeis performed while constructing the NFA graph. Each ofthe sub-expressions creates a sub-graph, which are mergedin the main graph using empty transitions. Removing theunnecessary empty transitions further optimizes the final NFA.The optimized NFA for the RE example in Fig. 3 is shown inFig. 4g.Fig. 4a — 4f depicts the transformation process for eachoperator from the RE (or AST) into an equivalent NFA.Using the
Subset Construction Algorithm [15], the opti-mized NFA is converted into an equivalent DFA. During thistransformation, the PaREM creates a log file with the transitionig. 3: Abstract Syntax Tree representation for ( a | b )? c ∗ [0 .. b + RE (a) RE input: a (b) RE input: ab (c) RE input: a | b (d) RE input: a ∗ (e) RE input: a ? (f) RE input: a + (g) RE input: ( a | b )? c ∗ [0 .. b + Fig. 4: Transformation of RE operators into NFAtable. Theoretically, the DFAs number of states may have anexponential relationship to the NFAs number of states, whichleads to the well-known state explosion issue. However, mostof the real-world NFA produce a DFA with approximately thesame number of states.Finally, from the DFA we generate executable source codethat implements the REM for the corresponding DFA [14][16]. There are different possible ways of representing a DFA,but we have selected two different forms: (1) if-else statements, TABLE IV: System Configuration
OperatingSystem CentOS 6.2 (Linux kernel2.6.32)Processor × Intel® Xeon® Proces-sor E5-2695 v2 (2.40GHz,30MB Cache, 12 Cores)RAM × GBOpenMP 3.1 and (2) graphs .The if-else approach is a straightforward way of imple-menting a DFA. This approach creates an if-statement foreach transition of the automaton. However, this approach isnot recommended for large automatons. The if-else approachprovides a sequential solution for regular expression matching.The graph-based approach provides an easy way toadd/remove transitions or states in the automaton, and con-sequently reduces the risk of having incorrect representationof the automaton.For graph-based representation in the source code, we haveused an adjacency matrix , which represents the transitiontable. This approach has faster lookups to check for thepresence or absence of a specific transition, compared to the adjacency list representation of the automaton. The graph-based solution provides the implementation of the parallelregular expression-matching algorithm presented in this paper.IV. E
XPERIMENTAL E VALUATION
For experimental purposes, an automaton that finds alloccurrences of the word ”parallel” has been implemented,which results with an automaton with nine states (shown onFig. 1) and an alphabet of five characters. Table IV lists themajor features of experimentation platform. We use a shared-memory system with two 12-core Intel® Xeon® processorsof the type E5-2695 v2 for evaluation of our approach. Eachof the 12 physical cores supports two threads (also known aslogical cores). In total, our system has 24 physical cores or48 logical cores.Fig. 5a — 5e depicts the performance results for five prob-lem sizes and various numbers of threads. Each experimenthas been repeated 20 times to address the random performancefluctuations. The string length determines the problem size andin our experiment, we used five strings of following lengths:6.69e+07, 1.34e+08, 2.68e+08, 5.36e+08 and 1.07e+09.Execution times are shown in Fig. 5a — 5e, whereas thespeedup is depicted in Fig. 5f. The speed up for the smallestinput length (6.69e+07 characters) in our set of experimentsclosely follows the linear speedup up to 24 threads (Fig.5f). For larger input lengths, we may observe noteworthyspeedup improvements for 24 and 48 threads. Consideringall experiments the highest speedup of × was achieved forinput length 6.69e+07 characters and 48 threads.Table V shows the influence of input length in the cachemisses and the speedup. We varied the input length using 24and 48 threads. With the increase of input length, the number a) input length: 6.69e+07 (b) input length: 1.34e+08(c) input length: 2.68e+08 (d) input length: 5.36e+08(e) input length: 1.07e+09 (f) speedup Fig. 5: Performance results. As input are used five strings of the following lengths: 6.69e+07, 1.34e+08, 2.68e+08, 5.36e+08and 1.07e+09. Execution times are shown in (a e), whereas the speedup is shown in (f). The speed up for the smallest inputlength (6.69e+07 characters) in our set of experiments closely follows the linear speedup up to 24 threads. The maximumspeedup of × is achieved for 48 threads and input string of 6.69e+07 characters.ABLE V: Influence of Input length in cache misses andspeedup for 24 and 48 threads Input Length 24 threads 48 threadsCacheMisses[106] Speedup CacheMisses[106] Speedup6.69e+07 36.34 19.32 36.76 21.081.34e+07 70.15 10.44 71.07 17.272.68e+08 167.57 7.87 140.57 11.815.36e+08 339.26 7.18 367.07 9.991.07e+09 681.71 5.62 716.02 6.69 of cache misses increases and the speedup decreases. For thesmallest input length in our set of experiments (6.69e+07characters) that largely fits in the available cache, using 24threads, the number of cache misses is 36.34e+06 and thespeedup is . × . For the largest input length (1.07e+09) weobtained 681.71e+06 cache misses and a speedup of . × .The obtained cache misses for 48 threads are comparableto those for 24 threads (see Table V). For the smallest inputlength the number of cache misses is 36.76e+06 and thespeedup is . × . For the largest input length (1.07e+09)we obtained 716.02e+06 cache misses and a speedup of . × . We may observe that for all tested input lengths thespeedup-gain when 48 logical cores (hyper-threading) are usedcompared to 24 physical cores. A. Performance comparison of PaREM algorithm with theGeneral Enumeration Approach
The main difference of the PaREM algorithm and theGeneral Enumeration Approach (Enum) proposed by [6] isthe way of speculation of the next set of possible initial statesfor each chunk of the input string. While the Enum algorithmfor general DFAs considers all the states of the automaton asinitial states, the PaREM algorithm finds the most accurateinitial states. Comparing to PaREM that requires only fivecalculations to find the correct path, the Enum algorithmrequires 28 calculations to be performed in order to find thecorrect initial states for the example described in section II.C.We have run the experiment example from section II.Cwith the same input sizes and number of threads for theGeneral Enumeration approach as well. Figure 6a — 6edepicts the impact of finding the most accurate initial statesin the time execution. The sequential version (running in onethread) is the same for both algorithms, because they startthe calculations from state q on processing unit P . TheEnumeration Approach requires more calculations for finiteautomata with larger number of states, larger input size andfor higher number of processing units.The execution time of the Enumeration Approach comparedto the PaREM algorithm increases as we increase either theinput size or the number of threads. The execution time ofPaREM is . × better than Enum, which is achieved inthe largest number of threads (48) and the biggest problemsize(1.07e+09), and only . × better than Enum for the (a) input length: 6.69e+07(b) input length: 1.34e+08(c) input length: 2.68e+08(d) input length: 5.36e+08(e) input length: 1.07e+09 Fig. 6: Comparison between PaREM algorithm and GeneralEnumeration Approach.smallest number of threads (6) and the smallest input size(6.69+e07).. R
ELATED W ORK
Holub and Stekr [6] propose an approach for parallel REMvia DFA by splitting the input string in small chunks andrunning these chunks on each core, but due to pre-calculationof initial states for each sub input, this was not efficient forgeneral DFA. Their algorithm runs efficiently for a specifictype of DFA, so called synchronizing automata, that relies onthe input automaton being k-local.Yang and Prassana [7] propose the segmentation of regularexpressions and perform the REM evaluation via nondetermin-istic finite automata. The major aim is to optimize the use ofmemory hierarchy in case of automata with many states andlarge transition table. In contrast to our approach, the authorsof [7] focus on large automata but do not address specificallyalgorithmic optimizations with respect to large input strings.Mytkowicz and Schulte [8] propose an approach that ex-ploits SIMD, instruction and thread level parallelism in thecontext of finite state machines computations. To increase theopportunities for data-parallelism authors of [11] have deviseda method for breaking data-dependencies with enumeration.This approach is not based on speculation with respect toinitial state determination.Kumar et al. [17] address the issue of large-scale finiteautomata (also known as the state explosion problem) bysplitting regular expressions into two parts: (1) a prefix thatcontains frequently visited parts of the automata, and (2) asuffix that is the rest of the automaton. The aim is to have asmall DFA for frequently accessed parts of automata that fitsin cache memory.Luchaup et al. [18] propose an approach of finding thecorrect initial state by speculation. They believe that guessingthe state of the DFA at certain position (network intrusiondetection DFA based scanning spends most of the time in afew hot states) has a very good chance that after a few stepswill reach the correct state. They validate these guesses usinga history of speculated states. In comparison to our algorithm,the convergence of the guessed state and the correct state isnot guaranteed. Furthermore, if a thread does not converge onits sub input, then the next thread is forced to start from a newstate, which limits the scalability [8].Our algorithm is based on splitting the input into smallersub-inputs (domain decomposition); however, we have deviseda method to bypass the need of pre-calculation of all initialstates by finding the most accurate possible initial states. Ourapproach is not limited to a particular type of DFA, and isefficient for a large spectrum of input sizes.In contrast to the related work, our tool is capable ofautomatically generating a ready to compile and execute codefor shared-memory systems, by taking as input a RE or FA.VI. S
UMMARY AND F UTURE W ORK
Regular expression matching is essential for many applica-tions such as lexical analysis, data mining [19], or networksecurity. We have presented a parallel algorithm for regularexpression matching that is based on our improved speculativedetermination of initial states. Our tool PaREM transforms automatically any regular ex-pression or finite automata into the corresponding parallel code(C++ and OpenMP), and consequently eases the access to theproposed parallel algorithm for the users without backgroundin parallel programming. Preliminary experimental resultsshow that the performance of our algorithm gracefully scalesfor various string lengths and numbers of threads. For an inputstring of 6.69e+07 characters, we obtained a speedup of × with 48 threads.In future, we plan to evaluate our approach for other typesof problems, such as DNA sequencing or Network IntrusionDetection Systems. We also plan to extend our implementationfor heterogeneous systems.R EFERENCES[1] A. Nowzari-Dalini, E. Elahi, H. Ahrabian, and M. Ronaghi, “A newdna implementation of finite state machines.”
IJCSA , vol. 3, no. 1, pp.51–60, 2006.[2] A. BabuKaruppiah and S. Rajaram, “Deterministic finite automata forpattern matching in fpga for intrusion detection,” in
Computer, Com-munication and Electrical Technology (ICCCET), 2011 InternationalConference on , March 2011, pp. 167–170.[3] R. Kosala, M. Bruynooghe, J. V. den Bussche, and H. Blockeel,“Information extraction from web documents based on local unrankedtree automaton inference.” in
IJCAI , G. Gottlob and T. Walsh, Eds.Morgan Kaufmann, 2003, pp. 403–408.[4] S. Pllana, S. Benkner, E. Mehofer, L. Natvig, and F. Xhafa, “Towardsan intelligent environment for programming multi-core computing sys-tems.” in
Euro-Par Workshops , ser. Lecture Notes in Computer Science,vol. 5415. Springer, 2008, pp. 141–151.[5] T. Mytkowicz and W. Schulte, “Maine: A library for data parallel finiteautomata,” Tech. Rep. MSR-TR-2012-62, July 2012. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=168379[6] J. Holub and S. Stekr, “On parallel implementations of deterministicfinite automata.” in
CIAA , ser. Lecture Notes in Computer Science,S. Maneth, Ed., vol. 5642. Springer, 2009, pp. 54–64.[7] Y.-H. E. Yang and V. K. Prasanna, “Optimizing regular expressionmatching with sr-nfa on multi-core systems.” in
PACT
Foundations of Computer Science: C Edition ,ser. Principles of computer science series. W. H. Freeman, 1994.[12] J. Hopcroft and J.D. Ullman,
Introduction to Automata Theory, Lan-guages and Computation
Compilers: Principles, Techniques and Tools (for Anna Univer-sity), 2/e . Pearson Education India.[15] C.-H. Chang and R. Paige, “From regular expressions to dfa’s usingcompressed nfa’s.”
Theor. Comput. Sci. , vol. 178, no. 1-2, pp. 1–36,1997.[16] A. Arora and A. Shefali Bansal,
Comprehensive Computer and Lan-guages . Laxmi Publications, 2005.[17] S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese, “Curingregular expressions matching algorithms from insomnia, amnesia, andacalculia.” in
ANCS , R. Yavatkar, D. Grunwald, and K. K. Ramakrish-nan, Eds. ACM, 2007, pp. 155–164.[18] D. Luchaup, R. Smith, C. Estan, and S. Jha, “Speculative parallel patternmatching.”
IEEE Transactions on Information Forensics and Security ,vol. 6, no. 2, pp. 438–451, 2011.[19] R. Trasarti, F. Bonchi, and B. Goethals, “Sequence mining automata: Anew technique for mining frequent sequences under regular expressions,”in