[PDF] Better Automata through Process Algebra

Abstract

This paper shows how the use of Structural Operational Semantics (SOS) in the style popularized by the process-algebra community can lead to a more succinct and useful construction for building finite automata from regular expressions. Such constructions have been known for decades, and form the basis for the proofs of one direction of Kleene's Theorem. The purpose of the new construction is, on the one hand, to show students how small automata can be constructed, without the need for empty transitions, and on the other hand to show how the construction method admits closure proofs of regular languages with respect to other operators as well. These results, while not theoretically surprising, point to an additional influence of process-algebraic research: in addition to providing fundamental insights into the nature of concurrent computation, it also sheds new light on old, well-known constructions in automata theory.

Full PDF

aa r X i v : . [ c s . F L ] F e b BETTER AUTOMATA THROUGH PROCESS ALGEBRA

RANCE CLEAVELANDDepartment of Computer Science, University of Maryland, College Park MD 20742 USA e-mail address : [email protected]

Abstract.

This paper shows how the use of Structural Operational Semantics (SOS) inthe style popularized by the process-algebra community can lead to a more succinct anduseful construction for building ﬁnite automata from regular expressions. Such construc-tions have been known for decades, and form the basis for the proofs of one direction ofKleene’s Theorem. The purpose of the new construction is, on the one hand, to show stu-dents how small automata can be constructed, without the need for empty transitions, andon the other hand to show how the construction method admits closure proofs of regularlanguages with respect to other operators as well. These results, while not theoreticallysurprising, point to an additional inﬂuence of process-algebraic research: in addition toproviding fundamental insights into the nature of concurrent computation, it also shedsnew light on old, well-known constructions in automata theory. Introduction

It is an honor to write this paper in celebration of Jos Baeten on the occasion of the publica-tion of his

Festschrift . I recall ﬁrst becoming aware of Jos late in my PhD studies at CornellUniversity. Early in my doctoral career I had become independently interested in processalgebra, primarily through Robin Milner’s original monograph,

A Calculus of Communi-cating Systems [Mil80], and indeed wound up writing my dissertation on the topic. I wasworking largely on my own; apart from very stimulating interactions with Prakash Panan-gaden, who was at Cornell at the time, there were no researchers in the area at Cornell. Itwas in this milieu that I stumbled across the seminal papers by Jos’ colleagues, Jan Bergstraand Jan Willem Klop, describing the Algebra of Communicating Processes [BK84, BK85]. Iwas impressed with their classically algebraic approach, and their semantic accounts basedon graph constructions. This, together with Milner’s focus on operational semantics andthe Communicating Sequential Processes community’s on denotational semantics [BHR84],ﬁnally enabled me to truly understand the deep and satisfying links between operational, de-notational and axiomatic approaches to not only process algebra, but to program semanticsin general.While Jos was not a co-author of the two papers just cited, he was an early contributorto the process-algebraic ﬁeld and has remained a proliﬁc researcher in both theoretical and

Key words and phrases:

Process algebra; ﬁnite automata; regular expressions; operational semantics.Research supported by US Oﬃce of Naval Research Grant N000141712622.

LOGICAL METHODSIN COMPUTER SCIENCE DOI:10.2168/LMCS-??? c (cid:13)

R. CleavelandCreative Commons applied aspects of the discipline. I have followed his career, and admired his interest inboth foundational theory and practical applications of process theory, since completing myPhD in 1987. It is this broader view on the impact of process algebra that is the motivationfor this note. Indeed, I will not focus so much on new theoretical results, satisfying thoughthey can be. Rather, I want recount a story about my usage of process-algebra-inspiredtechniques to redevelop part of an undergraduate course on automata theory that I taughtfor a number of years. Speciﬁcally, I will discuss how I have used the Structural OperationalSemantics (SOS) techniques used extensively in process algebra to present what I havefound to be more satisfying ways than those typically covered in textbooks to constructﬁnite automata from regular expressions. Such constructions constitute a proof of one halfof Kleene’s Theorem [Kle56], which asserts a correspondence between regular languages andthose accepted by ﬁnite automata.In the rest of this paper I present the construction and contrast it to the constructionsfound in classical automata-theory textbooks such as [HMU06], explaining why I ﬁnd thework presented here preferable from a pedagogical point of view. I also brieﬂy situatethe work in the setting of an eﬃcient technique [BS86] used in practice for convertingregular expressions to ﬁnite automata. The messsage I hope to convey is that in additionto contributing foundational understanding to notions of concurrent computation, processalgebra can also cast new light on well-understood automaton constructions as well, andthat pioneers in process algebra, such as Jos Baeten, are doubly deserving of the accoladesthey receive from the research community.2.

Alphabets, Languages, Regular Expressions and Automata

This section reviews the deﬁnitions and notation used later in this note for formal languages,regular expressions and ﬁnite automata. In the interest of succinctness the deﬁnitions departslightly from those found in automata-theory textbooks, although notationally I try to followthe conventions used in those books.2.1.

Alphabets and Languages.

At their most foundational level digital computers aredevices for computing with symbols. Alphabets and languages formalize this intuitionmathematically.

Deﬁnition 2.1 (Alphabet, word) . (1) An alphabet is a ﬁnite non-empty set Σ of symbols.(2) A word over alphabet Σ is a ﬁnite sequence a . . . a k of elements from Σ. We saythat k is the length of w in this case. If k = 0 we say w is empty ; we write ε forthe (unique) empty word over Σ. Note that every a ∈ Σ is also a (length-one) wordover Σ. We write Σ ∗ for the set of all words over Σ.(3) If w = a . . . a k and w = b . . . b ℓ are words over Σ then the concatenation , w · w ,of w and w is the word a . . . a k b . . . b n . Note that w · ε = ε · w = w for any word w . We often omit · and write w w for the concatenation of w and w .(4) A language L over alphabet Σ is a subset of Σ ∗ . The set of all languages over Σis the set of all subsets of Σ ∗ , and is written 2 Σ ∗ following standard mathematicalconventions. ETTER AUTOMATA THROUGH PROCESS ALGEBRA 3

Since languages over Σ ∗ are sets, general set-theoretic operations, including ∪ (union), ∩ (intersection) and − (set diﬀerence) may be applied to them. Other, language-speciﬁcoperations may also be deﬁned. Deﬁnition 2.2 (Language concatenation, Kleene closure) . Let Σ be an alphabet.(1) Let L , L ⊆ Σ ∗ be languages over Σ. Then the concentation , L · L , of L and L is deﬁned as follows. L · L = { w · w | w ∈ L and w ∈ L } (2) Let L ⊆ Σ ∗ be a language over Σ. Then the Kleene closure , L ∗ , of L is deﬁnedinductively as follows. • ε ∈ L ∗ • If w ∈ L and w ∈ L ∗ then w · w ∈ L ∗ .2.2. Regular Expressions.

Regular expressions provide a notation for deﬁning languages.

Deﬁnition 2.3 (Regular expression) . Let Σ be an alphabet. Then the set, R (Σ), of regularexpressions over Σ is deﬁned inductively as follows. • ∅ ∈ R (Σ). • ε ∈ R (Σ). • If a ∈ Σ then a ∈ R (Σ). • If r ∈ R (Σ) and r ∈ R (Σ) then r + r ∈ R (Σ) and r · r ∈ R (Σ). • If r ∈ R (Σ) then r ∗ ∈ R (Σ).It should be noted that R (Σ) is a set of expressions; the occurrences of ∅ , ε, + , · and ∗ aresymbols that do not innately possess any meaning, but must instead be given a semantics.This is done by interpreting regular expressions mathematically as languages. The formaldeﬁnition takes the form of a function, L ∈ R (Σ) → Σ ∗ assigning a language L ( r ) ⊆ Σ ∗ toregular expression r . Deﬁnition 2.4 (Language of a regular expression, regular language) . Let Σ be an alphabet,and r ∈ R (Σ) a regular expression over Σ. Then the language , L ( r ) ⊆ Σ ∗ , associated with r is deﬁned inductively as follows. L ( r ) =  ∅ if r = ∅{ ε } if r = ε { a } if r = a and a ∈ Σ L ( r ) ∪ L ( r ) if r = r + r L ( r ) · L ( r ) if r = r · r ( L ( r ′ )) ∗ if r = ( r ′ ) ∗ A language L ⊆ Σ ∗ is regular if and only if there is a regular expression r ∈ R (Σ) such that L ( r ) = L . Textbooks typically deﬁne L ∗ diﬀerently, by ﬁrst introducing L i for i ≥ L ∗ = S ∞ i =0 L i R. CLEAVELAND

Finite Automata.

Traditional accounts of ﬁnite automata typically introduce threevariations of the notion: deterministic (DFA), nondeterministic (NFA), and nondetermin-istic with ε -transitions (NFA- ε ). I will do the same, although I will do so in a somewhatdiﬀerent order than is typical. Deﬁnition 2.5 (Nondeterministic Finite Automaton (NFA)) . A nondeterministic ﬁniteautomata (NFA) is a tuple ( Q, Σ , δ, q I , F ), where: • Q is a ﬁnite non-empty set of states ; • Σ is an alphabet ; • δ ⊆ Q × Σ × Q is the transition relation ; • q I ∈ Q is the initial state ; and • F ⊆ Q is the set of accepting , or ﬁnal , states.This deﬁnition of NFA diﬀers slightly from e.g. [HMU06] in that δ is given as relationrather than function in Q × Σ → Q . It also deﬁnes the form of a NFA but not the sense inwhich it is indeed a machine for processing words in a language. The next deﬁnition doesthis by associating a language L ( M ) with a given NFA M = ( Q, Σ , δ, q I , F ). Deﬁnition 2.6 (Language of a NFA) . Let M = ( Q, Σ , δ, q I , F ) be a NFA.(1) Let q ∈ Q be a state of M and w ∈ Σ ∗ be a word over Σ. Then M accepts w from q if and only if one of the following holds. • w = ε and q ∈ F ; or • w = aw ′ some a ∈ Σ and w ′ ∈ Σ ∗ , and there exists ( q, a, q ′ ) ∈ δ such that M accepts w ′ from q ′ .(2) The language , L ( M ), accepted by M is deﬁned as follows. L ( M ) = { w ∈ Σ ∗ | M accepts w from q I } Deterministic Finite Automata (DFAs) constitute a subclass of NFAs whose transitionrelation is deterministic, in a precisely deﬁned sense.

Deﬁnition 2.7 (Deterministic Finite Automaton (DFA)) . NFA M = ( Q, Σ , δ, q I , F ) is a deterministic ﬁnite automaton (DFA) if and only if δ satisﬁes the following: for every q ∈ Q and a ∈ Σ, there exists exactly one q ′ such that ( q, a, q ′ ) ∈ δ .Since DFAs are NFAs the deﬁnition of L in Deﬁnition 2.6 is directly applicable to themas well. NFAs with ǫ -transitions are now deﬁned as follows. Deﬁnition 2.8 (NFAs with ε -Transitions) . A nondeterministic automaton with ε -transitions (NFA- ε ) is a tuple ( Q, Σ , δ, q I , F ), where: • Q is a nonempty ﬁnite set of states ; • Σ is an alphabet , with ε Σ; • δ ⊆ Q × (Σ ∪ { ε } ) × Q is the transition relation ; • q I ∈ Q is the initial state ; and • F is the set of accepting , or ﬁnal , states.An NFA- ε is like a NFA except that some transitions can be labeled with the emptystring ε rather than a symbol from Σ. The intution is that a transition of form ( q, ε, q ′ ) canoccur without consuming any symbol as an input. Formalizing this intuition, and deﬁning L ( M ) for NFA- ε , may be done as follows. Deﬁnition 2.9 (Language of a NFA- ε ) . Let M = ( Q, Σ , δ, q I , F ) be a NFA- ε . ETTER AUTOMATA THROUGH PROCESS ALGEBRA 5 (1) Let q ∈ Q and w ∈ Σ ∗ . Then M accepts w from q if and only if one of the followingholds. • w = ε and q ′ ∈ F ; or • w = aw ′ for some a ∈ Σ and w ′ ∈ Σ ∗ and there exists q ′ ∈ Q such that( q, a, q ′ ) ∈ δ and M accepts w ′ from q ′ ; or • there exists q ′ ∈ Q such that ( q, ε, q ′ ) ∈ δ and M accepts w from q ′ .(2) The language , L ( M ), accepted by M is deﬁned as follows. L ( M ) = { w ∈ Σ ∗ | M accepts w from q I } Deﬁning the language of a NFA- ε requires redeﬁning the notion of a machine acceptinga string from state q as given in the deﬁnition of the language of a NFA. This redeﬁnitionreﬂects the essential diﬀerence between ε -transitions and those labeled by alphabet symbols.The three types of automata have diﬀerences in form, but equivalent expressive power.It should ﬁrst be noted that, just as every DFA is already a NFA, every NFA is also aNFA- ε , namely, a NFA- ε with no ε -transitions. Thus, every language accepted by someDFA is also accepted by some NFA, and every language accepted by some NFA is acceptedby some NFA- ε . The next theorem establishes the converses of these implications. Theorem 2.10 (Equivalence of DFAs, NFAs and NFA- ε s) . (1) Let M be a NFA. Then there is a DFA D ( M ) such that L ( D ( M )) = L ( M ) . (2) Let M be a NFA- ε . Then there is a NFA N ( M ) such that L ( N ( M )) = L ( M ) .Proof. The proof of Case (1) involves the well-known subset construction, whereby eachsubset of states in M is associated with a single state in D ( M ). The proof of Case (2)typically relies on deﬁning the ε closure of a set of states, namely, the set of states reachablefrom the given set via a sequence of zero or more ε -transitions. This notion is used to deﬁnethe transition relation of N ( M ) as well as its set of accepting states.3. Kleene’s Theorem

Given the deﬁnitions in the previous section it is now possible to state Kleene’s Theoremsuccinctly.

Theorem 3.1 (Kleene’s Theorem) . Let Σ be an alphabet. Then L ⊆ Σ ∗ is regular if andonly if there is a DFA M such that L ( M ) = L . The proof of this theorem is usually split into two pieces. The ﬁrst involves showingthat for any regular expression r , there is a ﬁnite automaton M (DFA, NFA or NFA- ε )such that L ( M ) = L ( r ). Theorem 2.10 then ensures that the resulting ﬁnite automaton,if it is not already a DFA, can be converted into one in a language-preserving manner.The second shows how to convert a DFA M into a regular expression r in such a way that L ( r ) = L ( M ); there are several algorithms for this in the literature, including the classicdynamic-programming-based method of Kleene [Kle56] and equation-solving methods thatrely on Arden’s Lemma [Ard61].From a practical standpoint, the conversion of regular expressions to ﬁnite automatais the more important, since regular expressions are textual and are used consequently asthe basis for string search and processing. For this reason, I believe that teaching thisconstruction is especially keyin automata-theory classes, and this where my complaint withthe approaches in traditional automata-theory texts originates. R. CLEAVELAND

To understand the basis for my dissatisfaction, let us review the construction presentedin [HMU06], which explains how to convert regular expression r into NFA- ε M r in such a waythat L ( r ) = L ( M r ). The method is based on the construction due to Ken Thompson [Tho68]and produces NFA- ε M r with the following properties. • The initial state q I has no incoming transitions: that is, there exists no ( q, α, q I ) ∈ δ . • There is a single accepting state q F , and q F has no outgoing transitions: that is, F = { q F } , and there exists no ( q F , α, q ′ ) ∈ δ .The approach proceeds inductively on the structure of r . For example, if r = ( r ′ ) ∗ , thenassume that M r ′ = ( Q, Σ , δ, q I , { q F } ) meeting the above constraints has been constructed.Then M r is built as follows. First, let q ′ I Q and q ′ F Q be new states. Then M r =( Q ∪ { q ′ I , q ′ F } , Σ , δ ′ , { q ′ F } ), where δ ′ = δ ∪ { ( q ′ I , ε, q I ) , ( q ′ I , ε, q ′ F ) , ( q F , ε, q I ) , ( q F , ε, q ′ F ) } . It can be shown that M r satisﬁes the requisite properties and that L ( M r ) = ( L ( r ′ )) ∗ .Mathematically, the construction of M r is wholly satisfactory: it has the required prop-erties and can be deﬁned relatively easily, albeit at the cost of introducing new states andtransitions. The proof of correctness is perhaps somewhat complicated, owing to the deﬁni-tion of L ( M ) and the subtlety of ε -transitions, but it does acquaint students with deﬁnitionsvia structural induction on regular expressions.My concern with the construction, however, is several-fold. On the one hand, it doesrequire the introduction of the notion of NFA- ε , which is indeed more complex that that ofNFA. In particular, the deﬁnition of acceptance requires allowing transitions that consumeno symbol in the input word. On the other hand, the accretion of the introduction ofnew states at each state in the construction makes it diﬃcult to test students on theirunderstanding of the construction in an exam setting. Speciﬁcally, even for relatively smallregular expressions the literal application of the construction yields automata with too manystates and transitions to be doable during the typical one-hour midterm exam for which USstudents would be tested on the material. Finally, the construction bears no resemblanceto algorithms used in practice for construction ﬁnite automata from regular expressions. Inparticular routines such as the Berry-Sethi procedure [BS86] construct DFAs directly fromregular expressions, completely avoiding the need for NFA- ε s, or indeed NFAs, altogether.The Berry-Sethi procedure is subtle and elegant, and relies on concepts, such as Br-zozowski derivatives [Brz64], that I would view as too specialized for an undergraduatecourse on automata theory. Consequently, I would not be in favor of covering them in anundergraduate classroom setting. Instead, in the next section I give a technique, based onoperational semantics in process algebra, for construction NFAs from regular expressions.The resulting NFAs are small enough for students to construct during exams, and the con-struction has other properties, including the capacity for introducing other operations thatpreserve regularity, that are pedagogically useful.4. NFAs via Structural Operational Semantics

This section describes an approach based on

Structural Operational Semantics (SOS) [Plo81,Plo04] for constructing NFAs from regular expressions. Speciﬁcally, I will deﬁne a (small-step) operational semantics for regular expressions on the basis of the structure of regularexpressions, and use the semantics to construct the requisite NFAs. The construction

ETTER AUTOMATA THROUGH PROCESS ALGEBRA 7 requires no ε -transitions and yields automata with at most one more state state than thesize of the regular expression from which they are derived.Following the conventions in the other parts of this paper I give the SOS rules usingnotation typically found in automata-theory texts. In particular, the SOS speciﬁcation isgiven in natural language, as a collection of if-then statements, and not via inference rules.I use this approach in the classroom to avoid having to introduce notations for inferencerules. In the appendix I give the more traditional SOS presentation.4.1. An Operational Semantics for Regular Expressions.

In what follows ﬁx alpha-bet Σ. The basis for the operational semantics of regular expressions consists of a relation, −→⊆ R (Σ) × Σ × R (Σ), and a predicate √ ⊆ R (Σ). In what follows I will write r a −→ r ′ and r √ in lieu of ( r, a, r ′ ) ∈ −→ and r ∈ √ . The intuitions are as follows.(1) r √ is intended to hold if and only if ε ∈ L ( r ). This is used in deﬁning acceptingstates.(2) r a −→ r ′ is intended to reﬂect the following about L ( r ): one way to build a word in L ( r ) is to start with a ∈ Σ and then ﬁnish it with a word from L ( r ′ ).Using these relations, I then show how to build a NFA from r whose states are regularexpressions, whose transitions are given by −→ , and whose ﬁnal states are deﬁned using √ . Deﬁning √ and −→ . We now deﬁne √ . Deﬁnition 4.1 (Deﬁnition of √ ) . Predicate r √ is deﬁned inductively on the structure of r ∈ R (Σ) as follows. • If r = ε then r √ . • If r = ( r ′ ) ∗ for some r ′ ∈ R (Σ) then r √ . • If r = r + r for some r , r ∈ R (Σ), and r √ , then r √ . • If r = r + r for some r , r ∈ R (Σ), and r √ , then r √ . • If r = r · r for some r , r ∈ R (Σ), and r √ and r √ , then r √ .From the deﬁnition, one can see it is not the case that ∅√ or a √ , for any a ∈ Σ,while both ε √ and r ∗ √ always. This accords with the deﬁnition of L ( r ); ε

6∈ L ( ∅ ) = ∅ , and ε

6∈ L ( a ) = { a } , while ε ∈ L ( ε ) = { ε } and ε ∈ L ∗ for any language L ⊆ Σ ∗ , and in particularfor L = L ( r ) for regular expression r . The other cases in the deﬁnition reﬂect the fact that ε ∈ L ( r + r ) can only hold if ε ∈ L ( r ) or ε ∈ L ( r ), since + is interpreted as set union,and that ε ∈ L ( r · r ) can only be true if ε ∈ L ( r ) and ε ∈ L ( r ), since regular-expressionoperator · is interpreted as language concatenation. We have the following examples.( ε · a ∗ ) √ since ε √ and a ∗ √ . ¬ ( a + b ) √ since neither a √ nor b √ .(01 + (1 + 01) ∗ ) √ since (1 + 01) ∗ √ . ¬ (01(1 + 01) ∗ ) √ since ¬ (01) √ .We also use structural induction to deﬁne −→ . Deﬁnition 4.2 (Deﬁnition of −→ ) . Relation r a −→ r ′ , where r, r ′ ∈ R (Σ) and a ∈ Σ, is deﬁnedinductively on r . • If r = a and a ∈ Σ then r a −→ ε . • If r = r + r and r a −→ r ′ then r a −→ r ′ . • If r = r + r and r a −→ r ′ then r a −→ r ′ . R. CLEAVELAND • If r = r · r and r a −→ r ′ then r a −→ r ′ · r . • If r = r · r , r √ and r a −→ r ′ then r a −→ r ′ . • If r = ( r ′ ) ∗ and r ′ a −→ r ′′ then r a −→ r ′′ · ( r ′ ) ∗ .The deﬁnition of this relation is somewhat complex, but the idea that it is trying tocapture is relatively simple: r a −→ r ′ if one can build words in L ( r ) by taking the a labeling −→ and appending a word from L ( r ′ ). So we have the rule a a −→ ε for a ∈ Σ, while the rulesfor + follow from the fact that L ( r + r ) = L ( r ) ∪ L ( r ). The cases for r · r in essencestate that aw ∈ L ( r · r ) can hold either if there is a way of splitting w into w and w suchthat aw is in the language of r and w is in the language of r , or if ε is in the language of r and aw is in the language of r . Finally, the rule for ( r ′ ) ∗ essentially permits “looping”.As examples, we have the following. a + b a −→ ε by the rules for a and +.( abb + a ) ∗ a −→ εbb ( abb + a ) ∗ by the rules for a , · , +, and ∗ .In this latter example, note that applying the deﬁnition literally requires the inclusion of the ε in εbb ( abb + a ) ∗ . This is because the case for a says that a a −→ ε , meaning that abb a −→ εbb ,etc. However, when there are leading instances of ε like this, I will sometimes leave themout, and write abb a −→ bb rather than abb a −→ εbb . The following lemmas about √ and −→ formally establish the intuitive properties thatthey should have. Lemma 4.3.

Let r ∈ R (Σ) be a regular expression. Then r √ if and only if ε ∈ L ( r ) .Proof. The proof proceeds by structural induction on r . Most cases are left to the reader;we only consider the r = r · r case here. The induction hypothesis states that r √ if andonly if ε ∈ L ( r ) and r √ if and only if ε ∈ L ( r ). One reasons as follows. r √ iﬀ r √ and r √ Deﬁnition of √ iﬀ ε ∈ L ( r ) and ε ∈ L ( r ) Induction hypothesisiﬀ ε ∈ ( L ( r )) · ( L ( r )) Property of concatenationiﬀ ε ∈ L ( r · r ) Deﬁnition of L ( r · r )iﬀ ε ∈ L ( r ) r = r · r Lemma 4.4.

Let r ∈ R (Σ) , a ∈ Σ , and w ∈ Σ ∗ . Then aw ∈ L ( r ) if and only if there is an r ′ ∈ R (Σ) such that r a −→ r ′ and w ∈ L ( r ′ ) .Proof. The proof proceeds by structural induction on r . We only consider the case r = ( r ′ ) ∗ in detail; the others are left to the reader. The induction hypothesis asserts that for all a and w ′ , aw ′ ∈ L ( r ′ ) if and only if there is an r ′′ such that r ′ a −→ r ′′ and w ′ ∈ L ( r ′′ ). We This convention can be formalized by introducing a special case in the deﬁnition of −→ for a · r anddistinguishing the current two cases for r · r to apply only when r Σ . ETTER AUTOMATA THROUGH PROCESS ALGEBRA 9 reason as follows. aw ∈ L ( r ) iﬀ aw ∈ L (( r ′ ) ∗ ) r = ( r ′ ) ∗ iﬀ aw ∈ ( L ( r ′ )) ∗ Deﬁnition of L (( r ′ ) ∗ )iﬀ aw = w · w some w ∈ L ( r ′ ) , w ∈ ( L ( r ′ )) ∗ Deﬁnition of Kleene closureiﬀ w = a · w ′ some w ′ Property of Kleene closureiﬀ r ′ a −→ r ′′ some r ′′ with w ′ ∈ L ( r ′′ ) Induction hypothesisiﬀ r a −→ r ′′ · ( r ′ ) ∗ Deﬁnition of −→ iﬀ w ′ · w ∈ L ( r ′′ ) · L (( r ′ ) ∗ ) Deﬁnition of concatenationiﬀ w ′ · w ∈ L ( r ′′ · ( r ′ ) ∗ ) Deﬁnition of L ( r ′′ · ( r ′ ) ∗ )iﬀ r a −→ r ′′ · ( r ′ ) ∗ and w ∈ L ( r ′′ · ( r ′ ) ∗ ) w = w ′ · w Appendix A contains deﬁnitions of √ and −→ in the more usual inference-rule style usedin SOS speciﬁcations.4.2. Building Automata using √ and −→ . That √ and −→ may be used to build NFAsderives from how they may be used to determine whether a string is in the language of aregular expression. Consider the following sequence of transitions starting from the regularexpression ( abb + a ) ∗ .( abb + a ) ∗ a −→ bb ( abb + a ) ∗ b −→ b ( abb + a ) ∗ b −→ ( abb + a ) ∗ a −→ ( abb + a ) ∗ Using Lemma 4.4 four times, we can conclude that if w ∈ L (( abb + a ) ∗ ), then abba · w ∈ L (( abb + a ) ∗ ) also. In addition, since ( abb + a ) ∗ √ , it follows from Lemma 4.3 that ε ∈ L (( abb + a ) ∗ ). Since abba · ε = abba , it follows that abba ∈ L (( abb + a ) ∗ ).More generally, if there is a sequence of transitions r a −→ r · · · a n −→ r n and r n √ , thenit follows that a . . . a n ∈ L ( r ), and vice versa. This observation suggests the followingstrategy for building a NFA from a regular expression r .(1) Let the states be all possible regular expressions that can be reached by some se-quence of transitions from r .(2) Take r to be the start state.(3) Let the transitions be given by −→ .(4) Let the accepting states be those regular expressions r ′ reachable from r for which r ′ √ holds.Of course, this construction is only valid if the set of all possible regular expressions men-tioned in Step (1) is ﬁnite, since NFAs are required to have a ﬁnite number of states. Infact, a stronger result can be proved. First, recall the deﬁnition of the size, | r | , of regularexpression r . Deﬁnition 4.5 (Size of a regular expression) . The size, | r | , of r ∈ R (Σ) is deﬁned induc-tively as follows. | r | =  r = ε, r = ∅ , or r = a for some a ∈ Σ | r ′ | + 1 if r = ( r ′ ) ∗ | r | + | r | + 1 if r = r + r or r = r · r Intuitively, | r | counts the number of regular-expression operators in r . The reachability set of regular expression r can now be deﬁned in the usual manner. Deﬁnition 4.6.

Let r ∈ R (Σ) be a regular expression. Then the set RS ( r ) ⊆ R (Σ) ofregular expressions reachable from r is deﬁned recursively as follows. • r ∈ RS ( r ). • If r ∈ RS ( r ) and r a −→ r for some a ∈ Σ, then r ∈ RS ( r ).As an example, note that | ( abb + a ) ∗ | = 8 and that RS (( abb + a ) ∗ ) = { ( abb + a ) ∗ , εbb ( abb + a ) ∗ , εb ( abb + a ) ∗ , ε ( abb + a ) ∗ } , (In this case I have not applied my heuristic of suppressing leading ε expressions.) Thefollowing can now be provd. Theorem 4.7.

Let r ∈ R (Σ) be a regular expression. Then | RS ( r ) | ≤ | r | + 1 .Proof. The proof proceeds by structural induction on r . There are six cases to consider. r = ∅ : In this case RS ( r ) = {∅} , and | RS ( r ) | = 1 = | r | < | r | + 1. r = ε : In this case RS ( r ) = { ε } , and | RS ( r ) | = 1 = | r | < | r | + 1. r = a for some a ∈ Σ : In this case RS ( r ) = { a, ε } , and | RS ( r ) | = 2 = | r | + 1. r = r + r : In this case, RS ( r ) ⊆ RS ( r ) ∪ RS ( r ), and the induction hypothesisguarantees that | RS ( r ) | ≤ | r | + 1 and RS ( r ) ≤ | r | + 1. It then follows that | RS ( r ) | ≤ | RS ( r ) | + | RS ( r ) | ≤ | r | + | r | + 2 = | r | + 1 .r = r · r : In this case it can be shown that RS ( r ) ⊆ { r ′ · r | r ′ ∈ RS ( r ) } ∪ RS ( r ).Since |{ r ′ · r | r ′ ∈ RS ( r ) }| = | RS ( r ) | , similar reasoning as in the + case applies. r = ( r ′ ) ∗ : In this case we have that RS ( r ) ⊆ { r } ∪ { r ′′ ; r | r ′′ ∈ RS ( r ′ ) } . Thus | RS ( r ) | ≤ | RS ( r ′ ) | + 1 ≤ | r ′ | + 2 = | r | + 1 . This result shows not only that the sketched NFA construction given above yields aﬁnite number of states for given r , it in fact establishes that this set of state is no largerthan | r | + 1. This highlights one of the main reasons I opted to introduce this constructionin my classes: small regular expressions yield NFAs that are almost as small, and can beconstructed manually in an exam setting.We can now formally deﬁne the construction of NFA M r from regular expression r asfollows. Deﬁnition 4.8.

Let r ∈ R (Σ) be a regular expression. Then M r = ( Q, Σ , q I , δ, A ) is theNFA deﬁned as follows. • Q = RS ( r ). • q I = r . • δ = { ( r , a, r ) | r a −→ r } . • F = { r ′ ∈ Q | r ′ √} .The next theorem establishes that r and M r deﬁne the same languages. Theorem 4.9.

Let r ∈ R (Σ) be a regular expression. The L ( r ) = L ( M r ) .Proof. Relies on the fact that Lemmas 4.3 and 4.4 guarantee that w = a . . . a n ∈ L ( r ) ifand only if there is a regular expression r ′ such that r a −→ · · · a n −→ r ′ and r ′ √ . ETTER AUTOMATA THROUGH PROCESS ALGEBRA 11 out ( r ) =  ∅ if r = ∅ or r = ε { ( a, ε ) } if r = a ∈ Σ out ( r ) ∪ out ( r ) if r = r + r { ( a, r ′ · r ) | ( a, r ′ ) ∈ out ( r ) }∪ { ( a, r ′ ) | ( a, r ′ ) ∈ out ( r ) ∧ r √} if r = r · r { ( a, r ′′ · ( r ′ ) ∗ ) | ( a, r ′ ) ∈ out ( r ) } if r = ( r ′ ) ∗ Figure 1: Calculating the outgoing transitions of regular expressions.4.3.

Computing M r . This section gives a routine for computing M r . It intertwines thecomputation of the reachability set from regular expression r with the updating of thetransition relation and set of accepting states. It relies on the computation of the so-called outgoing transitions of r ; these are deﬁned as follows. Deﬁnition 4.10.

Let r ∈ R (Σ) be a regular expression. Then the set of outgoing transitions from r is deﬁned as the set { ( a, r ′ ) | r a −→ r ′ } .The outgoing transitions from r consists of pairs ( a, r ′ ) that, when combined with r ,constitute a valid transition r a −→ r ′ . Figure 1 deﬁnes a recursive function, out , for computingthe outgoing transitions of r . The routine uses the structure of r and the deﬁnition of −→ toguide its computation. For regular expressions of the form ∅ , ε and a ∈ Σ, the deﬁnition of −→ in Deﬁnition 4.2 immediately gives all the transitions. For regular expressions built using + , · and ∗ , one must ﬁrst recursively compute the outgoing transitions of the subexpressions of r and then combine the results appropriately, based on the cases given in the Deﬁnition 4.2.The next lemma states that out ( r ) correctly computes the outgoing transitions of r . Lemma 4.11.

Let r ∈ R (Σ) be a regular expression, and let out ( r ) be as deﬁned in Figure 1.Then out ( r ) = { ( a, r ′ ) | r a −→ r ′ } .Proof. By structural induction on r . The details are left to the reader.Algorithm 1 contains pseudo-code for computing M r . It maintains four sets. • Q , a set that will eventually contain the states of M r . • F , a set that will eventually contain the accepting states of M r . • δ , a set that will eventually contain the transition relation of M r . • W , the work set , a subset of Q containing states that have not yet had their outgoingtransitions computed or acceptance status determined.The procedure begins by adding r , its input parameter, to both Q and W . It then repeatedlyremoves a state from W , determines if it should be added to F , computes its outgoingtransitions and updates δ appropriately, and ﬁnally adds the target states in the outgoingtransition set to both Q and W if they are not yet in Q (meaning they have not yet beenencountered in the construction of M r ). The algorithm terminates when W is empty.Figure 2 gives the NFA resulting from applying the procedure to ( abb + a ) ∗ . Figure 3,by way of contrast, shows the result of applying the routine in [HMU06] to produce a NFA- ε from the same regular expression. Algorithm 1:

Algorithm for computing NFA M r from regular expression r Algorithm

NFA ( r ) Input :

Regular rexpression r ∈ R (Σ) Output:

NFA M r = ( Q, Σ , q I , δ, F ) Q := { r } // State set q I := r // Start state W := { r } // Working set δ := ∅ // Transition relation F := ∅ // Accepting states while W = ∅ do choose r ′ ∈ W W := W − { r ′ } if r ′ √ then F := F ∪ { r ′ } // r ′ is an accepting state T = out ( r ′ ) // Outgoing transitions of r ′ δ := δ ∪ { r ′ , a, r ′′ ) | ( a, r ′′ ) ∈ T } // Update transition relation foreach ( a, r ′′ ) ∈ T do if r ′′ Q then Q := Q ∪ { r ′′ } // r ′′ is a new expression W := W ∪ { r ′′ } end end return M r = ( Q, Σ , δ, q I , F ) 5. Discussion

The title of this note is “Better Automata through Process Algebra,” and I want to revisitit in order to explain in what respects I regard the method presented in here as producing“better automata.” Earlier I identiﬁed the following motivations that prompted me toincorporate this approach in my classroom instruction. • I wanted to produce NFAs rather than NFA- ε s. In large part this was due tomy desire not cover the notion of NFA- ε . The only place this material is used intypical automata-theory textbooks is as a vehicle for converting regular expressionsinto ﬁnite automata. By giving a construction that avoids the use of ε -transitions, Icould avoid covering NFA- ε s and devote the newly freed lecture time to other topics.Of course, this is only possible if the NFA-based construction does not require moretime to describe than the introduction of NFA- ε and the NFA- ε construction. • I wanted the construction to be one that students could apply during an examto generate ﬁnite automata from regular expressions. The classical constructionfound in [HMU06] and other books fails this test, in my opinion; while the inductivedeﬁnitions are mathematically pleasing, they yield automata with too many statesfor students to be expected to apply them in a time-constrained setting.

ETTER AUTOMATA THROUGH PROCESS ALGEBRA 13 ( abb + a ) ∗ εbb ( abb + a ) ∗ εb ( abb + a ) ∗ ε ( abb + a ) ∗ a ab b a Figure 2: NFA( r ) for r = ( abb + a ) ∗ . • Related to the preceding point, I wanted a technique that students could imaginebeing implemented and used in the numerous applications to which regular expres-sions are applied. In such a setting, fewer states is better than more states, allthings considered.This note has attempted to argue these points by giving a construction in Deﬁnition 4.8for constructing NFAs directly from regular expressions. Theorem 4.7 estabishes that thenumber of states in these NFAs is at most one larger than the size of the regular expressionfrom which the NFAs are generated; this provides guidance in preparing exam questions, asthe size of the NFAs students can be asked to generate are tightly bounded by the size of theregular expression given in the exam. Finally, Algorithm 1 gives a “close-to-code” accountof the construction that hints at its implementability. Indeed, several years ago a couple ofstudents that I presented this material to independently implemented the algorithm.Beyond the points mentioned above, I think this approach has two other points inits favor. The ﬁrst is that is provides a basis for deﬁning other operators over regularexpressions and proving that the class of regular languages is closed with result to theseoperations. The ingredients for introducing such a new operator and proving closure ofregular languages with respect to it can be summarized as follows.(1) Extend the deﬁnition of L ( r ) given in Deﬁnition 2.4 to give a language-theoreticsemantics for the operator.(2) Extend the deﬁnitions of √ and −→ in Deﬁnitions 4.1 and 4.2 to give a small-stepoperations semantics for the operator. ε εaεbεb aε εε εε ε Figure 3: NFA- ε for ( abb + a ) ∗ .(3) Extend the proofs of Lemmas 4.3 and 4.4 to establish connections between thelanguage semantics and the operational semantics.(4) Prove that expressions extended with the new operator yield ﬁnite sets of reachableexpressions.All of these steps involve adding new cases to the existing deﬁnitions and lemmas, andaltering Theorem 4.7 in the case of the last point. Once these are done, Algorithm 1, withthe deﬁnition of out given in Figure 1 suitably modiﬁed to cover the new operator, can beused as is as a basis for constructing NFAs from these extended classes of regular languages. ETTER AUTOMATA THROUGH PROCESS ALGEBRA 15

I have used parts of this approach in the classroom to ask students to prove that syn-chronous product and interleaving operators can be shown to preserve language regularity.Other operators, such as ones from process algebra, are also candidates for these kinds ofquestions.The second feature of the approach in this paper that I believe recommends it is that theNFA construction is “on-the-ﬂy”; the construction of a automaton from a regular expressiondoes not require the a priori construction of automata from subexpressions, meaning thatthe actual production of the automaton can be intertwined with other operations, such asthe checking of whether a word belongs to the regular expression’s language. One does notneed to wait the construction of the full automaton, in other words, before putting it touse. Criticisms that I have heard of this approach center around two issues. The ﬁrst is thatthe construction of NFA M r from regular expression r does not use structural inductionon r , unlike the classical constructions in e.g. [HMU06]. I do not have much patiencewith the complaint, as the concepts that M r is built on, namely √ and −→ , are deﬁnedinductively, and the results proven about them require substantial use of induction. Theother complaint is that the notion of r a −→ r ′ is “hard to understand.” It is indeed thecase that equipping regular expressions with an operational semantics is far removed fromthe language-theoretic semantics typically given to these expressions. That said, I wouldargue that the small-step operational semantics considered here in fact exposes the essenceof the relationship between regular expressions and ﬁnite automata: this semantics enablesregular expressions to be executed, and in a way that can be captured via automata.I close this section with a brief discussion of the Berry-Sethi algorithm [BS86], whichis used in practice and produces deterministic ﬁnite automata. This feature enables theirtechnique to accommodate complementation, an operation with respect to which regularlanguages are closed but which ﬁts uneasily with NFAs. From a pedagogical perspective,however, the algorithm suﬀers somewhat as number of states in a DFA can be exponentiallylarger than that size of the regular expression from which it is derived. A similar criticismcan be made of other techniques that rely on Brzozowsky derivatives [Brz64], which alsoproduce DFAs. There are interesting connections between our operational semantics andthese derivatives, but we exploit nondeterminacy to keep the sizes of the resulting ﬁniteautomata small. 6. Conclusions and Directions for Future Work

In this note I have presented an alternative approach for converting regular expressionsinto ﬁnite automata. The method relies on deﬁning an operational semantics for regularexpressions, and as such draws inspiration from the work on process algebra undertakenby pioneers in that ﬁeld, including Jos Baeten. In contrast with classical techniques, theconstruction here does not require transitions labeled by the empty word ε , and it yieldsautomata whose state sets are proportional in size to the regular expressions they comefrom. The procedure can also be implemented in an on-the-ﬂy manner, meaning that theproduction of the automaton can be intertwined with other analysis procedures as well.Other algorithms studied in process algebra also have pedagogical promise, in my opin-ion. One method, the Kanellakis-Smolka algorithm for computing bisimulation equiva-lence [KS90], is a case in point. Partition-reﬁnement algorithms for computing langaugeequivalence of deterministic automata have been in existence for decades, but the details underpinning them are subtle and diﬃcult to present in an undergraduate automata-theoryclass, where instructional time is at a premium. While not as eﬃcient asymptotically asthe best procedures, the simplicity of the K-S technique recommends it, in my opinion,both for equivalence checking and state-machine minimization. Simulation-checking algo-rithms [HHK95] can also be used as a basis for checking language containment amongﬁnite automata; these are interesting because they do not require determinization of bothautomata being compared, in general. References [Ard61] Dean N Arden. Delayed-logic and ﬁnite-state machines. In , pages 133–151. IEEE, 1961.[BHR84] Stephen D. Brookes, C. A. R. Hoare, and A. W. Roscoe. A theory of communicating sequentialprocesses.

Journal of the ACM , 31(3):560–599, 1984.[BK84] J.A. Bergstra and J.W. Klop. Process algebra for synchronous communication.

Information andControl , 60(1):109–137, 1984.[BK85] Jan A. Bergstra and Jan Willem Klop. Algebra of communicating processes with abstraction.

Theoretical Computer Science , 37:77–121, 1985.[Brz64] Janusz A. Brzozowski. Derivatives of regular expressions.

Journal of the ACM (JACM) , 11(4):481–494, 1964.[BS86] Gerard Berry and Ravi Sethi. From regular expressions to deterministic automata.

TheoreticalComputer Science , 48:117–126, 1986.[HHK95] Monika Rauch Henzinger, Thomas A. Henzinger, and Peter W. Kopke. Computing simulations onﬁnite and inﬁnite graphs. In

Proceedings of IEEE 36th Annual Foundations of Computer Science ,pages 453–462. IEEE, 1995.[HMU06] John E. Hopcroft, Rajeev Motwani, and Jeﬀrey D. Ullman.

Introduction to Automata Theory, Lan-guages, and Computation (3rd Edition) . Addison-Wesley Longman Publishing Co., Inc., Boston,2006.[Kle56] S.C. Kleene. Representation of events in nerve nets and ﬁnite automata. In

Automata Studies ,pages 3–41. Princeton University Press, 1956.[KS90] Paris C. Kanellakis and Scott A. Smolka. Ccs expressions, ﬁnite state processes, and three problemsof equivalence.

Information and Computation , 86(1):43–68, 1990.[Mil80] Robin Milner.

A Calculus of Communicating Systems , volume 92 of

Lecture Notes in ComputerScience . Springer, 1980.[Plo81] Gordon D Plotkin. A structural approach to operational semantics. Technical report, AarhusUniversity, Denmark, 1981.[Plo04] Gordon D Plotkin. The origins of structural operational semantics.

The Journal of Logic andAlgebraic Programming , 60:3–15, 2004.[Tho68] Ken Thompson. Programming techniques: Regular expression search algorithm.

Communicationsof the ACM , 11(6):419422, June 1968.

Appendix A. SOS Rules for √ and −→ Here are the inference rules used to deﬁne √ . They are given in the form premisesconclusion with − denoting an empty list of premises. − ε √ − r ∗ √ r √ ( r + r ) √ r √ ( r + r ) √ r √ r √ ( r · r ) √ ETTER AUTOMATA THROUGH PROCESS ALGEBRA 17

Next are the rules for −→ . − a a −→ ε r a −→ r ′ r + r a −→ r ′ r a −→ r ′ r + r a −→ r ′ r a −→ r ′ r · r a −→ r ′ · r r √ r a −→ r ′ r · r a −→ r ′ r a −→ r ′ r ∗ a −→ r ′ · ( r ∗∗