[PDF] A Semantic Framework for PEGs

Abstract

Parsing Expression Grammars (PEGs) are a recognition-based formalism which allows to describe the syntactical and the lexical elements of a language. The main difference between Context-Free Grammars (CFGs) and PEGs relies on the interpretation of the choice operator: while the CFGs' unordered choice e | e' is interpreted as the union of the languages recognized by e and e, the PEGs' prioritized choice e/e' discards e' if e succeeds. Such subtle, but important difference, changes the language recognized and yields more efficient parsing algorithms. This paper proposes a rewriting logic semantics for PEGs. We start with a rewrite theory giving meaning to the usual constructs in PEGs. Later, we show that cuts, a mechanism for controlling backtracks in PEGs, finds also a natural representation in our framework. We generalize such mechanism, allowing for both local and global cuts with a precise, unified and formal semantics. Hence, our work strives at better understanding and controlling backtracks in parsers for PEGs. The semantics we propose is executable and, besides being a parser with modest efficiency, it can be used as a playground to test different optimization ideas. More importantly, it is a mathematical tool that can be used for different analyses.

Full PDF

aa r X i v : . [ c s . F L ] N ov A Semantic Framework for PEGs

Sérgio Queiroz de MedeirosCarlos Olarte ∗ [email protected]@gmail.com ECT, Federal University of Rio Grande do NorteNatal, Brazil

Abstract

Parsing Expression Grammars (PEGs) are a recognition-basedformalism which allows to describe the syntactical and thelexical elements of a language. The main diﬀerence betweenContext-Free Grammars (CFGs) and PEGs relies on the in-terpretation of the choice operator: while the CFGs’ unorderedchoice 𝑒 | 𝑒 ′ is interpreted as the union of the languagesrecognized by 𝑒 and 𝑒 ′ , the PEGs’ prioritized choice 𝑒 / 𝑒 ′ discards 𝑒 ′ if 𝑒 succeeds. Such subtle, but important diﬀer-ence, changes the language recognized and yields more ef-ﬁcient parsing algorithms. This paper proposes a rewritinglogic semantics for PEGs. We start with a rewrite theorygiving meaning to the usual constructs in PEGs. Later, weshow that cuts, a mechanism for controlling backtracks inPEGs, ﬁnds also a natural representation in our framework.We generalize such mechanism, allowing for both local andglobal cuts with a precise, uniﬁed and formal semantics. Hence,our work strives at better understanding and controllingbacktracks in parsers for PEGs. The semantics we proposeis executable and, besides being a parser with modest eﬃ-ciency, it can be used as a playground to test diﬀerent opti-mization ideas. More importantly, it is a mathematical toolthat can be used for diﬀerent analyses. CCS Concepts: • Theory of computation → Grammarsand context-free languages ; Rewrite systems ; •

Soft-ware and its engineering → Syntax ; Parsers . Keywords: parsing expression grammars, rewriting logic

ACM Reference Format:

Sérgio Queiroz de Medeiros and Carlos Olarte. 2020. A SemanticFramework for PEGs. In

Proceedings of the 13th ACM SIGPLAN In-ternational Conference on Software Language Engineering (SLE ’20), ∗ Carlos Olarte was funded by CNPq.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. Copyrights for compo-nents of this work owned by others than ACM must be honored. Abstract-ing with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior speciﬁc permission and/ora fee. Request permissions from [email protected].

November 16–17, 2020, Virtual, USA.

ACM, New York, NY, USA,17 pages. https://doi.org/10.1145/3426425.3426944

Parsing Expression Grammars (PEGs) [8] are the core of sev-eral widely used parsing tools. Visually, the description ofa PEG is similar to the description of a Context-Free Gram-mar (CFG). Unlike CFGs, PEGs have a deterministic orderedchoice operator, which allows a limited form of backtrack-ing. This makes PEGs a suitable formalism for representingdeterministic Context-Free Languages, i.e., the LR(k) languages.Another key diﬀerence between both formalisms is the pres-ence of syntactic predicates, that allow PEGs to describe thelexical elements of a language. PEGs were conceived as a for-malism to recognize strings, while CFGs are commonly usedto generate strings. Hence, it may not be trivial to determinethe language described by a PEG.Although PEGs ordered choice operator avoids ambigu-ities when writing a grammar, it also poses some diﬃcul-ties. To correctly recognize a language, the user of a PEG-based tool needs to be careful about the ordering of thealternatives in a choice 𝑒 / 𝑒 , as 𝑒 will never match astring 𝑥 when 𝑒 matches a preﬁx of 𝑥 . Regarding perfor-mance, PEGs local backtracking may still impose a perfor-mance drawback. Thus, when possible, it is desirable to avoidthe backtracking associated with a choice.In this paper we study the problem of backtracking inPEGs through the lens of a formal approach based on therewriting logic (RL) [17] framework. RL can be seen as aﬂexible and general model of computation where importantproperties of the modeled system can be speciﬁed and proved.We thus provide both: a formal foundation for better under-standing and controlling backtracks in PEGs; and tools thatcan help the user to check whether her grammar complieswith the intended meaning or not. As an interesting side ef-fect, we show that our speciﬁcation is not only a (correct byconstruction) recognition-based algorithm, but also a deriv-ative parser, able to generate strings from a grammar. Plan and contributions.

After recalling PEGs in §2, westart in §3 with a rewrite theory modeling the natural se-mantics rules for PEGs originally proposed in [15]. Hence,we obtain a formal model of the derivability relation in PEGs.

LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte

Due to the 𝜖 representational distance typical of RL speciﬁ-cations, the model and the actual system are very close, thusmaking it easier to reason about grammars in our frame-work.Inspired by a small-step semantics approach, §3.2 proposesan alternative (and equivalent) rewrite theory that elimi-nates the (unnecessary) non-deterministic steps of the ﬁrstone. Both theories are executable in Maude [4], an eﬃcientrewriting engine. We compare the two theories and showthat the second one is more amenable for automatic veri-ﬁcation techniques. Hence, besides being a formal tool forreasoning about PEGs, the proposed speciﬁcation is actuallya correct by construction parser with modest performance.Cut operators [20] have been introduced in PEGs to re-duce the number of backtracks and improve eﬃciency. How-ever, the semantics of these operators has remained infor-mal in the literature. Using RL, §4 gives a precise meaning tocuts. The proposed semantics makes evident why less mem-ory is needed when cut annotations are added to a grammar.More importantly, such formal account allows us to gener-alize the concept of cuts from local to global cuts in §4.2. Insome cases, global cuts may save more computations thanthe cuts proposed in [20]. In §4.3 and §4.4 we show that lo-cal and global cuts can coexist coherently with a clear anduniform semantics. §4.5 reports some benchmarks on gram-mars annotated with cuts.The machinery developed here can be leveraged to per-form other analyses in PEGs. We explore one of such anal-yses in §B, where we report on a preliminary attempt touse our speciﬁcation to not only recognize a given stringbut, symbolically, produce all the possible strings (up to abounded length) from a grammar. This kind of analyses canbe useful to highlight unexpected behaviors when a PEG isdesigned, and also to explore optimization ideas.§5 discusses related work and §6 concludes the paper. Thecompanion appendix contains detailed proofs of the main re-sults. All the rewrite theories proposed here and the bench-marks reported in §4.5 can be found (and reproduced) withthe Maude and script ﬁles available at the public repository https://github.com/carlosolarte/RESPEG . From now on, we shall write PEG to refer to the followingdeﬁnition of parsing expression grammars.

Deﬁnition 2.1 (Syntax) . A PEG 𝐺 is a tuple ( 𝑉 ,𝑇 , 𝑃, 𝑒 𝜄 ) ,where 𝑉 is a ﬁnite set of non-terminals, 𝑇 is a ﬁnite set ofterminals, 𝑃 is a total function from non-terminals to pars-ing expressions and 𝑒 𝜄 is the initial parsing expression. Weshall use A,B,C to range over elements in 𝑉 and 𝑎, 𝑏, 𝑐 torange over elements in 𝑇 . Parsing expressions, ranged overby 𝑒, 𝑒 , 𝑒 , etc., are built from the syntax: 𝑒 : = 𝜖 | 𝐴 | 𝑎 | 𝑒 𝑒 | 𝑒 / 𝑒 | 𝑒 ∗ | ! 𝑒 We assume that 𝑉 is partitioned into two (disjoint) sets 𝑉 𝐿𝑒𝑥 and 𝑉 𝑆𝑦𝑛 , where 𝑉 𝐿𝑒𝑥 is the set of non-terminals thatmatch lexical elements, also known as tokens, and 𝑉 𝑆𝑦𝑛 rep-resents the non-terminals that match syntactical elements.The empty parsing expression is 𝜖 . The expression 𝑒 𝑒 stands for the (sequential) composition of 𝑒 and 𝑒 . The ex-pression 𝑒 / 𝑒 denotes an ordered choice. The repetition of 𝑒 is written as 𝑒 ∗ . The look-ahead or negative predicate iswritten as ! 𝑒 . A predicate ! 𝑒 tests if the expression 𝑒 matchesthe input, without consuming it. Predicates are handy to de-scribe lexical elements and to act as guards in choice alter-natives.The function 𝑃 maps non-terminals into parsing expres-sions. 𝑃 ( 𝐴 ) denotes the expression associated to 𝐴 . Alterna-tively, 𝑃 can be seen as a set of rules of the form 𝐴 ← 𝑒 .Strings are built from terminal symbols and 𝜖 denotes theempty string. We shall use 𝑥,𝑦, 𝑤 to range over (possiblyempty) strings and 𝑥𝑦 denotes the concatenation of 𝑥 and 𝑦 . Semantics.

A parsing expression 𝑒 , when applied to an in-put string 𝑥 , either succeeds or fails. When the matching of 𝑒 succeeds, it consumes a preﬁx of the input. Such preﬁx canbe the empty string 𝜖 (and nothing is consumed).We deﬁne the states of a PEG parser as follows. Deﬁnition 2.2 (States) . Parsing states are built from 𝑆 :: = 𝐺 [ 𝑒 ] 𝑥 | fail | 𝑥 In 𝐺 [ 𝑒 ] 𝑥 , the expression 𝑒 in the context of the PEG 𝐺 is matched against the string 𝑥 . The expression fail repre-sents a failed attempt of matching . The state 𝑥 representsthe successful matching of an expression returning the suf-ﬁx 𝑥 . Deﬁnition 2.3 (Semantics) . The reduction relation

PEG { isthe least binary relation on parsing states satisfying the rulesin Fig. 1. The language recognized by 𝑒 in the context of 𝐺 is the set L ( 𝑒, 𝐺 ) = { 𝑥 | 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 } . Two parsing expres-sions are equivalent (in the context of 𝐺 ), notation 𝑒 ≡ 𝑒 ′ , if L ( 𝑒, 𝐺 ) = L ( 𝑒 ′ , 𝐺 ) .Let us dwell upon the rules in Fig. 1. If 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 , theexpression 𝑒 consumes a (possibly empty) preﬁx of 𝑥 and re-turns the remaining suﬃx 𝑦 . For instance, rule term.1 con-sumes the non-terminal 𝑎 in 𝑎𝑥 and returns 𝑥 . If 𝐺 [ 𝑒 ] 𝑥 PEG { fail , 𝑒 fails to match the input 𝑥 (see term.2 and term.3 ).In 𝐺 [ 𝑒 𝑒 ] 𝑥 , either 𝑒 fails to match 𝑥 and the wholeexpression fails ( seq.2 ) or 𝑒 succeeds on 𝑥 and the ﬁnalresult depends on the outcome of matching 𝑒 against theremaining suﬃx 𝑦 ( seq.1 ). Failing states can also take the form ( fail , 𝑦 ) where 𝑦 is the suﬃx of theinput that could not be recognized. For the sake of presentation, we shallignore here the suﬃx 𝑦 . Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA 𝐺 [ 𝜀 ] 𝑥 PEG { 𝑥 ( empty ) 𝐺 [ 𝑃 ( 𝐴 )] 𝑥 PEG { 𝑆𝐺 [ 𝐴 ] 𝑥 PEG { 𝑆 ( var ) 𝐺 [ 𝑎 ] 𝑎𝑥 PEG { 𝑥 ( term . ) 𝑏 ≠ 𝑎𝐺 [ 𝑏 ] 𝑎𝑥 PEG { fail ( term . ) 𝐺 [ 𝑎 ] 𝜀 PEG { fail ( term . ) 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 𝐺 [ 𝑒 ] 𝑦 PEG { 𝑆𝐺 [ 𝑒 𝑒 ] 𝑥 PEG { 𝑆 ( seq . ) 𝐺 [ 𝑒 ] 𝑥 PEG { fail 𝐺 [ 𝑒 𝑒 ] 𝑥 PEG { fail ( seq . ) 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦𝐺 [ 𝑒 / 𝑒 ] 𝑥 PEG { 𝑦 ( ord . ) 𝐺 [ 𝑒 ] 𝑥 PEG { fail 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆𝐺 [ 𝑒 / 𝑒 ] 𝑥 PEG { 𝑆 ( ord . ) 𝐺 [ 𝑒 ] 𝑥 PEG { fail 𝐺 [ ! 𝑒 ] 𝑥 PEG { 𝑥 ( not . ) 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦𝐺 [ ! 𝑒 ] 𝑥 PEG { fail ( not . ) 𝐺 [ 𝑒 ] 𝑥 PEG { fail 𝐺 [ 𝑒 ∗] 𝑥 PEG { 𝑥 ( rep . ) 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 𝐺 [ 𝑒 ∗] 𝑦 PEG { 𝑧𝐺 [ 𝑒 ∗] 𝑥 PEG { 𝑧 ( rep . ) Figure 1.

Semantics of PEGs. 𝑆 is a parsing state denotingeither fail or a string 𝑥 (Deﬁnition 2.2).When the ordered choice 𝑒 / 𝑒 is applied to 𝑥 , if 𝑒 matches 𝑥 , then the alternative 𝑒 is discarded ( ord.1 ). (Note the dif-ference w.r.t. CFGs). On the other side, if 𝑒 fails, a backtrackis performed and 𝑒 is applied to the string 𝑥 , regardless ofwhether 𝑒 consumed part of it or not ( ord.2 ).The look-ahead operator ! 𝑒 (or negative predicate) failswhen 𝑒 succeeds ( not.2 ) and succeeds (not consuming anyinput) when 𝑒 fails ( not.1 ). Note that the expression ! 𝑒 doesnot consume any preﬁx of the input.The repetition 𝑒 ∗ greedily matches 𝑒 against the input.If 𝑒 fails on 𝑥 , 𝑒 ∗ does not consume any input ( rep.1 ). Rule rep.2 speciﬁes the recursive case where 𝑒 ∗ continues match-ing the suﬃx 𝑦 . Note that this operator is diﬀerent from theusual one in regular expressions: 𝐺 [ 𝑎 ∗] 𝑥𝑦 PEG { 𝑦 iﬀ 𝑥 is a(possibly empty) string containing only the terminal sym-bol 𝑎 and 𝑦 starts with 𝑏 ≠ 𝑎 . This operator can be in factderived from the others: let 𝑒 be a parsing expression and 𝐴 𝑒 be a distinguished non-terminal symbol. Then, 𝑒 ∗ is equiva-lent to 𝐴 𝑒 where 𝐴 𝑒 ← 𝑒 / 𝜖 [8]. We shall keep this operatorin the syntax since it greatly simpliﬁes some examples.Clearly, if 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 then 𝑦 is a suﬃx of 𝑥 . In orderto guarantee termination on PEG { , a well-formedness condi-tion on grammars is assumed [8]: there are no left-recursiverules such as 𝐴 ← 𝐴 𝑒 / 𝑒 . Moreover, there are no expres-sions of the form 𝑒 ∗ where 𝑒 succeeds on some input 𝑥 notconsuming any preﬁx of it ([25]).Unlike CFGs, PEGs are deterministic [8]. This means that PEG { is a function.Without loss of generality, our analyses consider only PEGsthat satisfy the unique token preﬁx condition [6]. Roughly,tokens of the grammar are described by non-terminals 𝐴 ∈ 𝑉 𝐿𝑒𝑥 . Hence, at most one non-terminal 𝐴 ∈ 𝑉 𝐿𝑒𝑥 matches apreﬁx of the current input. This is the typical behavior ofparsing tools that have a separate lexer (e.g., yacc gets to-kens from lex for a given input).

Deﬁnition 2.4 (Unique token preﬁx) . Let 𝐺 = ( 𝑉 ,𝑇 , 𝑃, 𝑒 𝜄 ) , 𝑉 = 𝑉 𝐿𝑒𝑥 ∪ 𝑉 𝑆𝑦𝑛 , 𝐴, 𝐵 ∈ 𝑉 𝐿𝑒𝑥 s.t. 𝐴 ≠ 𝐵 , 𝑎 ∈ 𝑇 and 𝑥𝑦 be astring. 𝐺 has the unique token preﬁx property iﬀ 𝐺 [ 𝐴 ] 𝑎𝑥𝑦 PEG { 𝑦 implies that 𝐺 [ 𝐵 ] 𝑎𝑥𝑦 PEG { fail . This section proposes an executable rewriting logic seman-tics for PEGs that we later use to control backtracks (§4) andas a basis for a derivative parser (§B).Rewriting Logic (RL) [17] is a general model of computa-tion where proof systems, semantics of programming lan-guages and, in general, transition systems can be speciﬁedand veriﬁed. RL can be seen as a logic of change that can nat-urally deal with states and concurrent computations. Thereader can ﬁnd a detailed survey of RL in [18] and [7].In the following, we brieﬂy introduce the main conceptsof RL needed to understand this paper, and, at the same time,we introduce step by step the proposed semantics for PEGs.For the sake of readability, we shall adopt in most of thecases the notation of Maude [4], a high-level language thatsupports membership equational logic and rewriting logicspeciﬁcations. Thanks to its eﬃcient rewriting engine andits metalanguage capabilities, Maude turns out to be an ex-cellent tool for creating executable environments of variouslogics and models of computation. This will make our spec-iﬁcation executable .A rewrite theory is a tuple R = ( Σ , 𝐸 ⊎ 𝐵, 𝑅 ) . The staticbehavior of the system is modeled by the order-sorted equa-tional theory ( Σ , 𝐸 ⊎ 𝐵 ) and the dynamic behavior by the setof rewrite rules 𝑅 . These components are explained below. Equational theory.

The signature Σ deﬁnes a set of typedoperators used to build the terms of the language (i.e, thesyntax of the modeled system). 𝐸 is a set of equations over 𝑇 Σ (the set of terms built from Σ ) of the form 𝑡 = 𝑡 ′ if 𝜙 .The equations specify the algebraic identities that terms ofthe language must satisfy (e.g., | 𝜖 | = and | 𝑎𝑥 | = + | 𝑥 | where | 𝑥 | denotes the length of 𝑥 ). 𝐵 is a set of structuralaxioms over 𝑇 Σ for which there is a ﬁnitary matching al-gorithm. Such axioms include associativity, commutativity,and identity, or combinations of them. For instance, 𝜖 is theidentity for concatenation and then, modulo this axiom, theterm 𝑥𝜖 is equivalent to 𝑥 . The equational theory associatedto R thus deﬁnes deterministic and ﬁnite computations asin a functional programming language.Our semantics for PEGs starts by deﬁning an appropriateequational theory or functional module – fmod – in Maude’s LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte terminology. Here we deﬁne the set of terminal and non-terminal symbols and strings: fmod PEG-SYNTAX issort NTSymbol . --- Non-terminal symbolssort TSymbol . --- Terminal symbolssort Str . --- Stringssorts TChar TExp . --- Chars and character classes... --- to be completed/explained belowendfm

First, sorts (or types) are declared. As we shall see,

TChar is the basic block for building terms of type

Str (by juxta-position/concatenation) and

TExp will be populated with theusual patterns such as [0-9] , [a-z] , etc.As we said before, the equational theory is ordered-sorted.This means that besides having many sorts, there is a partialorder on sorts deﬁning a sub-typing relation: subsorts String < TChar < Str . --- Sub-typingsubsort TChar TExp < TSymbol .subsort Qid < NTSymbol . The sort

String is part of the standard library of Maudeand it represents the usual strings in programming languages.Hence, “(” is a Maude’s

String and also a

TChar in this speci-ﬁcation (being

String the least sort ). Note that terms of type

TChar and

TExp are also terminal symbols due to the subsortrelation. Hence, examples of valid terminal symbols are ";", "c", "3", [0-9] ,etc. A Qid is a qualiﬁed identiﬁer of the form 'X (using anapostrophe at the beginning of the expression). Examplesof non-terminal symbols are 'Begin , 'Statement , etc. In orderto make the presentation cleaner, in the forthcoming exam-ples, we shall omit the apostrophe and write simply Begin instead of 'Begin .Besides the sorts, the signature also speciﬁes the func-tional symbols or operators that deﬁne the syntax of themodel. Here some examples of constructors for the type

TExp : op [.] : -> TExp . --- Any characterops [0-9] [a-z] [A-Z] ... : -> TExp . All these operators are constants (functions without pa-rameters) of type

TExp . For the sort

Str , we have: op eps : -> Str . --- empty string--- Strings (concatenated with whitespace)op __ : TChar Str -> Str [right id: eps ] .eps is the deﬁned constant to denote the empty string 𝜖 .In Maude, “ _ ” denotes the position of an argument. The op-erator op __ (usually called empty syntax in Maude) receivestwo parameters and returns a Str . Note that eps is the rightidentity (an axiom associated to op __ ) for this operator (i.e., 𝑥𝜖 ≡ 𝑥 ). As an example, the term "(" "3" "+" "2" ")" is aninhabitant of Str . For the sake of readability, when no con-fusion arises, we shall omit quotes on characters and simplywrite ( 3 + 2 ) . Deterministic and terminating computations are speciﬁedvia equations. In fact, an equational theory is executableonly if it is terminating, conﬂuent and sort-decreasing [4].Under these conditions, the mathematical meaning of theequality 𝑡 ≡ 𝑡 ′ coincides with the following strategy: reduce 𝑡 and 𝑡 ′ to their unique (due to termination and conﬂuence)normal forms 𝑡 𝑐 and 𝑡 ′ 𝑐 using the equations of the theory as simpliﬁcation rules from left to right. Then, 𝑡 ≡ 𝑡 ′ iﬀ 𝑡 𝑐 = 𝐵 𝑡 ′ 𝑐 (note that = 𝐵 , equality modulo 𝐵 , is decidable since it is as-sumed a ﬁnitary matching algorithm for 𝐵 ).As an example, the following speciﬁcation checks whethera terminal symbol matches a TChar : op match : TSymbol TChar -> Bool .vars tc tc' : TChar . --- logical variableseq match(tc, tc') = tc == tc' .eq match([.], tc) = true .eq match([0-9], tc) = tc >= "0" and-then tc <= "9" .... The == symbol is built-in Maude (in this case, checkingwhether two String s are equal). Variables in equations areimplicitly universally quantiﬁed and the second equationmust be read as (∀ tc : TChar ) . ( match ( [ . ] , tc ) = true ) An equation then rewrites its left hand side into the righthand side simplifying terms. For instance,the term match([.], "a" ) reduces into true

The syntax of parsing expressions is deﬁned as follows: sort Exp . --- Parsing expressionsop emp : -> Exp . --- the empty expression--- Terminal and non-terminal symbols are expressionssubsort TSymbol NTSymbol < Exp .op _._ : Exp Exp -> Exp . --- Sequenceop _/_ : Exp Exp -> Exp . --- Ordered choiceop _* : Exp -> Exp . --- Repetitionop !_ : Exp -> Exp . --- Negative predicate

Note that sequential composition is represented as e.e' .For the sake of readability, we shall omit the “.” and simplywrite e e' when no confusion arises.It is also possible to specify derived constructors by deﬁn-ing the appropriate equations. For instance, the and predi-cate & 𝑒 can be deﬁned as !! 𝑒 [8]: op &_ : Exp -> Exp . --- derived operatoreq & e = ! ! e . --- equation given meaning to & e Hence, & 𝑒 attempts to match 𝑒 and, if it succeeds, then itbacktracks to the starting point (not consuming any part ofthe input). More examples of derived constructors are op _? : Exp -> Exp . --- zero or oneop _+ : Exp -> Exp . --- one or moreeq e ? = e / emp .eq e + = e . e * . Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA

Production rules ( 𝐴 ← 𝑝 ), grammars (as sets of produc-tion rules) and parsing states are deﬁned as follows: sorts Rule Grammar State .subsort Rule < Grammar . --- Every rule is a grammarop _<-_ : NTSymbol Exp -> Rule .op nil : -> Grammar . --- empty set of rules--- Concatenating rulesop _,_ : Grammar Grammar->Grammar [comm assoc id:nil] .--- Parsing statesop _[_]_ : Grammar Exp Str -> State . --- G[e]xop fail : -> State . --- failsubsort Str < State . --- x is also a state Note the axioms imposed on the operator for “concate-nating” rules: the order of the rules ( comm for commutativity)as well as parentheses ( assoc for associativity) are irrelevant.Moreover, nil can be removed/added at will.Thanks to the almost zero representational distance [18]between the above speciﬁcation and the syntax of PEGs,we can see the system and its speciﬁcation as isomorphicstructures (with slightly diﬀerent notations). From now on,it should be clear from the context that 𝐺 in the expres-sion “ 𝐺 [ 𝑒 ] 𝑥 ” is a grammar while G in “ G[ e ] x ” is a termof sort

Grammar . As noticed, we have consistently used the monospace font for objects in the speciﬁcation.This concludes the speciﬁcation of the equational theorydeﬁning the syntax and basic operations on PEGs. The nextstep is to deﬁne rules encoding the semantics for PEG’s con-structors.

The last component 𝑅 in the rewrite the-ory R = ( Σ , 𝐸 ⊎ 𝐵, 𝑅 ) is a ﬁnite set of rewriting rules ofthe form 𝑡 → 𝑡 ′ if 𝜙 . A rule deﬁnes a state transformationand 𝑅 models the dynamic behavior of the system (whichis not necessarily deterministic, nor terminating). In Maude,rewrite theories are deﬁned in modules: mod PEG-SEMANTICS ispr PEG-SYNTAX . --- Importing PEG-SYNTAXvar x : Str .var G : Grammar .rl [empty] : G[ emp ] x => x ....endm “ empty ” is the name of the rule and variables are implicitlyquantiﬁed. Hence, the above rule must be interpreted as: (∀ G : Grammar , x : Str ) . ( G [ emp ] x ⇒ x ) This rule reﬂects the behavior of the rule empty in Fig. 1.The rules for terminal symbols are: rl [Terminal12] : G[t] tc x =>if match(t,tc) then x else fail fi .rl [Terminal3] : G[t] eps => fail .

The ﬁrst rule encodes both term.1 and term.2 where theoutcome depends on whether t matches tc . The second ruleencodes the behavior of term.3 .The other rules are conditional rules ( crl instead of rl )where the transition happens only if the condition after thesymbol if holds: crl [NTerminal] : ( G, N <- e ) [N] x => Sif (G, N <- e) [e] x => S .crl [Seq1] : G [e . e'] x => Sif G[ e ] x => y /\G[ e' ] y => S .crl [Seq2] : G[e . e'] x => fail if G[ e ] x => fail .crl [Choice1] : G[ e / e'] x => y if G[ e ] x => y .crl [Choice2] : G[e / e'] x => Sif G[ e ] x => fail /\G[ e' ] x => S .crl [Star1] : G[ e * ] x => x if G[ e ] x => fail .crl [Star2] : G[ e * ] x => zif G[ e ] x => y /\G[e *] y => z .crl [Neg1] : G [ ! e ] x => x if G[ e ] x => fail .crl [Neg2] : G [ ! e ] x => fail if G[ e ] x => y . In the case of [NTerminal] , due to the axioms of the oper-ator _,_ (i.e., [comm assoc id:nil] ), the order of the rules isirrelevant. Hence, if the current expression is N ( N is a vari-able of type NTerminal ), this rule will unfold the productionrule 𝑁 ← 𝑒 and try to match e with the current input.In Seq1 , it must be the case that

G[e] x reduces to y and theconﬁguration G[e'] y must reduce to the state S . This reﬂectsexactly the behavior of the rule seq.1 . The other rules canbe explained similarly.A rewrite theory R induces a rewrite relation R → on 𝑇 Σ (X) (the set of terms build from Σ and the countably set of vari-ables X ) deﬁned for every 𝑡, 𝑢 ∈ 𝑇 Σ (X) by 𝑡 R → 𝑢 if andonly if there is a rule ( 𝑙 → 𝑟 if 𝜙 ) ∈ 𝑅 and a substitution 𝜃 : X −→ 𝑇 Σ (X) satisfying 𝑡 = 𝐸 ⊎ 𝐵 𝑙𝜃 , 𝑢 = 𝐸 ⊎ 𝐵 𝑟𝜃 , and 𝜙𝜃 is(equationally) provable from 𝐸 ⊎ 𝐵 [2]. In words, 𝑡 matchesmodule 𝐸 ⊎ 𝐵 the left hand side of the rewrite rule undera suitable substitution and then evolves into the right handside of the rule if the condition 𝜙𝜃 holds. The relation R → ∗ is the reﬂexive and transitive closure of R → . Moreover, weshall use 𝑡 R → ! 𝑡 ′ to denote that 𝑡 R → ∗ 𝑡 ′ and 𝑡 ′ R → , i.e., 𝑡 reduces in zero or more steps to 𝑡 ′ and it cannot be furtherreduced ( 𝑡 ′ is called a normal form ).We shall use PEG ⇒ to denote the induced rewrite relationfrom the above described theory. As a simple example, notethat G["a"."b"] "a" "b"

PEG ⇒ ! eps . Note also that the states x (forany Str x ) and fail are the only normal forms for

PEG ⇒ . LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte

Theorem 3.1 (Adequacy) . Let 𝐺 be a grammar, 𝑒 a parsingexpression and 𝑥 a string. Then the following holds: (1) 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 iﬀ G[e]x

PEG ⇒ ! y . (2) 𝐺 [ 𝑒 ] 𝑥 PEG { fail iﬀ G[e]x

PEG ⇒ ! fail .Proof. (sketch). We can show that, for any state 𝑆 , 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆 iﬀ G[e]x

PEG ⇒ ! S . This discharges both (1) and (2). Note thatsuch 𝑆 must be a normal form and then, either 𝑆 = fail or 𝑆 = 𝑥 for some string 𝑥 . The implication ( ⇒ ) is proved byinduction on the height of the derivation of 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆 .For the ( ⇐ ) side, assume that 𝑡 = G[e]x

PEG ⇒ ! S . This means thatthere exists 𝑛 > (since 𝐺 [ 𝑒 ] 𝑥 is not a normal form) and aderivation of the form 𝑡 = 𝑡 PEG ⇒ 𝑡 PEG ⇒ · · ·

PEG ⇒ 𝑡 𝑛 PEG ⇒ . Due tothe side conditions in the rules, each step on this derivationmay include the application of several rules. Hence, we pro-ceed by induction on 𝑚 where 𝑚 is the total number of rulesapplied (including side conditions) in the above derivation.See §A for more details. (cid:3) We can use this theory as a (naive) PEG parser. It suﬃcesto use the Maude’s rewriting mechanism on a PEG state: rewrite ('A <- "a" , 'B <- "b") ['A .'B] "a" "b" "c" .result State: "c"

As noticed, our speciﬁcation follows exactly the semanticrules in Fig. 1, which makes it easier to prove its correctness.However, the resulting rewrite theory is not completely sat-isfactory due to the conditional rules used in choices, se-quences and negative predicates: a whole derivation mustbe built to check whether the associated conditions are metor not. As we know, PEGs are deterministic and it should bepossible to make a more guided decision on, e.g., whether ord.1 or ord.2 should be used on a given string. For that,the strategy on 𝑒 / 𝑒 could be: reduce 𝑒 and, in the end,decide whether or not 𝑒 is discarded. This reduction strat-egy resembles a small-step semantics and we shall exploreit in the next section. We shall show that the new speciﬁ-cation is more eﬃcient for checking 𝑥 ∈ L ( 𝐺 ) and, moreimportantly, it is appropriate for other (symbolic) analyses. Semantic and logical frameworks are adequate tools for spec-ifying and reasoning about diﬀerent systems. By choosingthe right abstractions in the framework, we can also have ef-ﬁcient procedures for such speciﬁcations. In the following,we introduce a rewrite theory that gives an alternative rep-resentation of choices and failures in PEGs.We ﬁrst introduce semantic-level constructs for negation,choices, sequential composition and replication: op NEG : State State -> State [ frozen (2)] .op CHOICE : State State -> State [ frozen (2)] .op COMP : State State -> State [ frozen (2)] . op STAR : State State -> State [ frozen (2)] .

Note that such operators are deﬁned on states. The at-tribute frozen will be explained in brief.The rules empty , Terminal12 and

Terminal3 are the same asin

PEG ⇒ . Note that those rules are not conditional.Let us introduce the new rules: rl [NTerm] : (G, N <- e) [N] x => (G, N <- e) [e] x .rl [Sequence] : G[e . e'] x => COMP(G[e]x , G[e']x ) .rl [Seq1] : COMP(y , G[e']x) => G[e'] y .rl [Seq2] : COMP(fail , S) => fail . In NTerm , N is “simpliﬁed” to the corresponding expression e . The rule Sequence replaces the PEG constructor e.e' withthe corresponding semantic-level constructor

COMP . The mean-ing of

COMP is given by the two last rules: if the ﬁrst com-ponent succeeds returning y , then the second expression e' must be evaluated against the input y . Moreover, if the ﬁrstcomponent fails, then the whole expression fails.The rules for the other constructors follow similarly: rl [Choice] : G[e / e']x => CHOICE(G[e] x , G[e'] x) .rl [Choice1] : CHOICE(x , S) => x .rl [Choice2] : CHOICE(fail , S) => S .rl [Star] : G[e *]x => STAR(G[e] x , G[e *] x) .rl [Star1] : STAR(fail, G[e *] x ) => x .rl [Star2] : STAR(y , G[e *]x) => STAR(G[e] y , G[e *]y) .rl [Negative] : G[! e] x => NEG(G[e]x , G[! e] x ) .rl [Neg1] : NEG(fail, G[! e] x) => x .rl [Neg2] : NEG(y, S) => fail . The ﬁrst rule in each case reduces the current expres-sion to the corresponding constructor; the second and thirdrules decide whether the expression fails or succeeds. Forinstance,

Choice1 discards the second alternative if the ﬁrstone succeeds and

Choice2 selects the second alternative if theﬁrst one fails.Rewriting logic is an inherent concurrent system whererules can be applied at any position/subterm of a biggerexpression. Hence, the use of the attribute frozen is impor-tant to guarantee the adequacy of our speciﬁcation. This at-tribute indicates that the second argument (of sort

State ) ofthe operators

COMP, CHOICE, STAR and

NEG cannot be subjectof reduction. In words, the state S' in CHOICE(S,S') is not re-duced until the ﬁrst component is completely reduced.We shall use

PEG → to denote the induced rewrite relation ofthe above speciﬁcation. Theorem 3.2 (Adequacy) . Let 𝐺 be a grammar, 𝑒 a parsingexpression and 𝑥 a string. Then the following holds: (1) G[e]x

PEG ⇒ ! y iﬀ G[e]x

PEG → ! y (2) G[e]x

PEG ⇒ ! fail iﬀ G[e]x

PEG → ! fail Proof.

We shall prove (1) and (2) simultaneously, i.e., we shallshow that 𝐺 [ 𝑒 ] 𝑥 PEG ⇒ ! 𝑆 iﬀ 𝐺 [ 𝑒 ] 𝑥 PEG → ! 𝑆 for 𝑆 ∈ { fail , 𝑥 } . Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA (⇒)

We proceed by induction on the total number of rules(including side conditions) used in the derivation 𝐺 [ 𝑒 ] 𝑥 PEG ⇒ ! 𝑆 . The result is not diﬃcult by noticing that in PEG ⇒ , the con-dition of the rules are satisﬁed by shorter derivations andthen, by induction, one can show that the second and thirdrules in the respective cases for PEG → mimic the same behavior. (⇐) Assume that 𝐺 [ 𝑒 ] 𝑥 PEG → ! 𝑆 . There are no side condi-tions but there are some extra intermediate steps due to thesemantic-level constructors CHOICE , COMP , etc and the rules

Choice , Sequence , etc. We note that

COMP(S, S') is not a normalform: if 𝑆 is a normal form, then Seq1 or Seq2 are used to con-tinue reducing the term. Also, due to the frozen attribute, S' is never reduced in the scope of the COMP constructor. Similarobservations apply for

CHOICE , NEG and

STAR . We then proceedby induction on the total number of steps needed to show 𝐺 [ 𝑒 ] 𝑥 PEG → ! 𝑆 . The base cases are easy. Consider the induc-tive case when the derivation starts with the rule Negative .Due to the above observations, we are in the following situ-ation: 𝐺 [ ! 𝑒 ] 𝑥 PEG → NEG ( 𝐺 [ 𝑒 ] 𝑥, 𝐺 [ ! 𝑒 ] 𝑥 ) PEG → ∗

NEG ( 𝑆 𝑐 , 𝐺 [ ! 𝑒 ] 𝑥 ) PEG → 𝑆 ′ where 𝑆 𝑐 is a normal form and the rule applied on thatstate can be either Neg1 or Neg2 depending on 𝑆 𝑐 . This meansthat there is a shorter derivation for 𝐺 [ 𝑒 ] 𝑥 PEG → ! 𝑆 𝑐 and theresult follows by induction and applying not.1 or not.2 ac-cordingly. (cid:3) From the above result, determinism and Theorem 3.1, weknow that

PEG → is a function. Also, a more direct proof of thatis possible: By simply inspecting the rules and noticing thatall the left-hand sides are pairwise distinct and only the left-most state can be reduced (due to the frozen attribute).The analyses we are currently working on (see §6 and §B),use symbolic techniques in Maude, as narrowing , that donot support conditional rewrite rules. This is one of the rea-sons to prefer PEG → over PEG ⇒ . Moreover, avoiding conditionalrules, PEG → outperforms PEG ⇒ for membership checking. As asimple benchmark, consider the following grammar recog-nizing the language 𝑎 𝑛 𝑏 𝑛 𝑐 𝑛 : S <- ( &(R1 "c") ) ("a" +) R2 (! [.]) ,R1 <- "a" (R1 ?) "b",R2 <- "b" (R2 ?) "c" In PEG ⇒ , the instance 𝑛 = is recognized in 30 seg, andthe instance 𝑛 = in 4.1 min. Hence, this speciﬁcation willbe of little help for practical purposes. PEG → recognizes the in-stance 𝑛 = in less than one second. Although some im-provements can be done such as using indices to avoid carry-ing in each state the fragment of the string being processed,state of the art parsers for PEGs (e.g., Mouse [26], Rats! [10]and Parboiled [22]) can certainly do much better than that. Our goal is not to build a parser but to provide a formalframework for PEGs and explore reasoning techniques forit. Hence, we prefer to keep the speciﬁcation as simple andgeneral as possible to widen the spectrum of analyses thatcan be performed on it. The reader may have probably rec-ognized the simplicity of the formal speciﬁcation, driven di-rectly from the syntax and semantic rules presented in §2. This section builds on the previous rewrite theory to studybacktracks in ordered choices. For that, we ﬁrst introducesemantic rules for the cut operator in [20]. By studying for-mally the meaning of cuts, we propose a deeper cut oper-ator (§4.2) that is able to save more computations. In §4.3we show how local and global/deeper cuts can be uniformlyintroduced in PEGs and some extra optimizations are pre-sented in §4.4. Finally, §4.5 reports some experiments ongrammars annotated with cuts.

The cut operator, inspired in Prolog’s cut, was proposed in[20] with the aim of better controlling backtracks in PEGs.The idea is to annotate the grammar 𝐺 with cuts leadingto a modiﬁed grammar 𝐺 ′ in such a way that the languagerecognized by 𝐺 and 𝐺 ′ is the same but 𝐺 ′ may avoid somebacktracks in choices during parsing. In fact, as shown in[20], this technique allows for dynamically reclaiming theunnecessary space for memoization in those branches.Cuts, as proposed in [20], cannot be introduced on ar-bitrary positions of a parsing expression. They only makesense when the backtracking mechanism needs to be con-trolled. Hence, there is a restricted syntax for them: 𝑒 :: = ... | 𝑒 ↑ 𝑒 / 𝑒 | ( 𝑒 ↑ 𝑒 )∗ The intended meaning for the ↑ operator is: - In 𝑒 ↑ 𝑒 / 𝑒 , if 𝑒 succeeds, 𝑒 is evaluated and 𝑒 is neverconsidered even if 𝑒 fails. - In ( 𝑒 ↑ 𝑒 )∗ , when 𝑒 fails, the entire expression succeeds.If 𝑒 fails (after the successful matching of 𝑒 ) the wholeexpression fails.Consider for instance the following grammar: S <- TIF ↑ "(" ... / TWHILE ↑ "(" ... / TFOR ... Clearly, if the token “ if ” was read from the input, a failureoccurring right after that point does not need to backtrack toconsider the other alternatives, that inevitable will also fail.The introduction of the cut guarantees that: (1) informationabout the other alternatives can be discarded once the rule TIF succeeds (thus saving space); and (2) there is no need toexplore other alternatives after failure (thus failing faster).Adding cuts to a grammar needs to be done carefully. Forinstance, the languages generated by the expressions 𝑎 𝑒 / 𝑎 𝑒 and 𝑎 ↑ 𝑒 / 𝑎 𝑒 LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte are not necessarily the same. Algorithms for adding cuts togrammars are proposed in [20]. Roughly, the

FIRST set isused to check whether the alternatives are disjoint. If this isthe case, it is safe to introduce a cut.We can formally state the semantics of cuts in the scopeof choices through the following rule: rl [Choice^] CHOICE(G[ ↑ e] x , S) => G[e] x. This rule reﬂects the fact that, once the symbol ↑ is foundin the context of an ordered choice, the second alternative(the variable S of type State ) can be discarded.Let us analyze the case of repetition. Since this operatorcan be encoded using recursion, one may simply use theprevious deﬁnition. However, in our framework, * is not aderived constructor and we are forced to give meaning to ( 𝑒 ↑ 𝑒 ′ )∗ . The ﬁrst thing we note is that the expression ( 𝑒𝑒 ′ )∗ never fails but ( 𝑒 ↑ 𝑒 ′ )∗ may fail. For instance, ( 𝑎𝑏 )∗ recognizes the string “abac” (returning “ac”) while ( 𝑎 ↑ 𝑏 )∗ fails to match the same string. This means that the use ofcuts in the context of repetitions requires extra care. In fact,the grammar transformations proposed in [20] translates ex-pressions of the form 𝑒 ∗ 𝑒 ′ into ( ! 𝑒 ′ ↑ 𝑒 ) ∗ 𝑒 ′ whenever 𝑒, 𝑒 ′ are both non-nullable expressions [25] (i.e., they do not rec-ognize the empty string) and 𝑒, 𝑒 ′ are disjoint. Hence, cutsin repetitions appear in very controlled ways.We can give meaning to ( 𝑒 ↑ 𝑒 ′ )∗ by dividing its execu-tion into two steps: when processing 𝑒 , failures cause thesuccess of the whole expression; when processing 𝑒 ′ , a fail-ure must cause the failure of the whole expression. For that,besides the semantic-level operator STAR introduced in theprevious section, we also consider the following one: op STAR^ : State State -> State [ctor frozen (2)] .

This operator will be used to keep track of the executionof the expression 𝑒 ′ as follows: rl [Star] : STAR(G[ ↑ e'] x' , S ) => STAR^( G[e'] x', S) .rl [Star^] : STAR^(y , G[ e *] x) => STAR(G[e] y, G[e *] y) .rl [Star^] : STAR^(fail, S) => fail . The ﬁrst rule makes the transition from

STAR to STAR^ . Thishappens when the current expression is a cut. The secondrule models the recursive behavior: if 𝑒 ′ succeeds, then weare back into the STAR state. The third rule reduces to fail if 𝑒 ′ fails.These rules formalize the behavior of cuts in agreementwith the intended meaning in [20]. However, the treatmentof cuts does not look uniform: the meaning of ↑ is given inthe context of other parsing expressions and not in a generalway. The next section shows that it is possible to give a moreprecise meaning to failures, cuts and backtracks. On doingthat, we propose a deeper cut operator with a more elegantsemantics that, in some cases, may avoid more backtracksthan ↑ . The cut operator ↑ acts locally. Take for instance the PEG A <- T1 ↑ e1 / T2 ↑ e2B <- T3 ↑ e3 / T4 ↑ e4C <- A / B where T1,T2,T3,T4 are lexical non-terminal symbols andthe grammar satisﬁes the unique token preﬁx condition (Def.2.4). Assume also that: the input starts with the string recog-nized by T1 ; the initial expression is C ; and the expression e1 fails to match the input. Due to the semantics of ↑ , the sec-ond alternative in A is discarded and A fails. However, suchfailure does not propagate to C and the alternative B is triedagainst the input. Since all the tokens are disjoint, after anattempt on T3 and T4 , the expression C ﬁnally fails. The keypoint is: failures in sub-expressions are conﬁned and they donot propagate to outer levels. In the following, we proposea new operator that acts as a cut but globally .We ﬁrst extend the set of states (Def. 2.2) with the con-stant error , denoting the fact that an unrecoverable error hasbeen thrown and therefore the global expression must fail: op error : -> State . Moreover, we add to the syntax of parsing expressionsthe operator ⇑ that, intuitively, throws an error: op throw : -> Exp . --- Representation of ⇑ op check : Exp -> Exp . --- Derived constructoreq check(E) = E / throw . The expression check(e) tries to match the expression e onthe current input. If it fails, an error is thrown.As we saw, the deﬁnition of the cut operator ↑ is prob-lematic since its behavior depends on the context where itis evaluated. The operator ⇑ can appear in any context andits semantics will be given in a uniform way. In fact, thedeﬁnition is quite simple: on any input, ⇑ raises an error: rl [throw] : G[throw] x => error . Since we have two failing states, namely fail and error ,the set of rules in §3.2 must be extended accordingly: rl [SeqE] : COMP(error, S) => error .rl [ChoiceE] : CHOICE(error, S) => error .rl [StarE] : STAR(error, S) => error .rl [NegE] : NEG(error, G[! e]x) => x .

These deﬁnitions allow for the propagation of the errorto the outermost context as the Example 4.1 below shows.From now on, we shall use

PEGe → to denote the induced rewrit-ing relation from the theory above. The new rules in natural-semantic style are in Figure 2. Example 4.1.

Consider the following grammar for labeled , goto and break statements: St <- labeledSt / jumpStlabeledSt <- ID check(COLON statement) /

Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA 𝐺 [⇑] 𝑥 PEGe { error ( throw ) 𝐺 [ 𝑒 ] 𝑥 PEGe { error 𝐺 [ 𝑒 𝑒 ] 𝑥 PEGe { error ( seq . ) 𝐺 [ 𝑒 ] 𝑥 PEGe { error 𝐺 [ 𝑒 / 𝑒 ] 𝑥 PEGe { error ( ord . ) 𝐺 [ 𝑒 ] 𝑥 PEGe { error 𝐺 [ ! 𝑒 ] 𝑥 PEGe { 𝑥 ( not . ) 𝐺 [ 𝑒 ] 𝑥 PEGe { error 𝐺 [ 𝑒 ∗] 𝑥 PEGe { error ( rep . ) Figure 2.

Global errors.

PEGe { extends PEG { in Fig 1. CASE constantExp COLON statementjumpSt <- GOTO check(ID SEMICOLON) / BREAK SEMICOLON

Let 𝐺 ′ be as 𝐺 but removing the check() annotations. Sincethe token expressions ID , CASE , GOTO and

BREAK are all disjoint,we can show that: (1)

G[St]x

PEGe → ∗ error implies

G'[St]x

PEG → ∗ fail . (2) G'[St]x

PEG → ∗ fail implies

G[St]x

PEGe → ∗ 𝑙 ∈ { fail, error } .In (1), the PEGe → -derivation does not need to match the (use-less) alternatives, thus failing in fewer (or equal) steps whencompared to the corresponding PEG { -derivationIn the example above, the number of backtracks is re-duced when processing syntactically invalid inputs. On validinputs, the number of backtracks remains the same. Whenpredicates are involved, it is also possible to save some few(unnecessary) backtracks. Example 4.2.

According to the ISO 7185 and ISO 10206standards, Pascal allows comments opened with (* and closedwith } . Consider the following grammar: comment <- open (!close [.]) * closeopen <- "(" "*" / "{"close <- "*" check(")") / "}" On input “{ comment *here* *)”, the ﬁrst and second “*”fail immediately (without trying to match “}”) and !close succeeds faster. Of course, this does not save much eﬀort.As another example, a typical rule for identiﬁers looks likethis:

ID <- ! KEYW [a-zA-Z] ([a-zA-Z0-9_] *)

Some cuts can be added to the deﬁnition of reserved words,thus making the above predicate to fail faster:

KEYW <- 'a' 'n' check('d') / 'a' check('s') /'b' 'o' check('o' 'l') / 'b' 'r' check('e' 'a' 'k')...

On input “bot”, ID succeeds and only the ﬁrst three choicesof KEYW will be evaluated.Similar to ↑ , a careless use of ⇑ may change the languagerecognized. Indeed, the situation is aggravated by the factthat errors propagate to outer levels. For instance, if St inExample 4.1 is extended with a third choice recognizing ID ,a failure in labeledSt should not be propagated to St since the new third alternative cannot be discarded. Hence, it issalutary to control the propagation of failures: a failure in labeledSt should avoid the (unnecessary) matching of jumpSt but such failure must remain conﬁned to these two alterna-tives. This local behavior (akin to the one of ↑ ) in combina-tion with global failures will be explored in the next section. Now we introduce a catch mechanism to control the globalbehavior of ⇑ and conﬁne errors when needed. The follow-ing deﬁnitions extend the syntax of expressions and states: op catch : Exp -> Exp . --- Catch operator (syntax)op CATCH : State -> State . --- Semantics (on states) The intended meaning is the following. If the expression e succeeds, then catch(e) also succeeds. If e fails with ﬁnaloutcome 𝑙 ∈ { error, fail } , the expression catch(e) fails withoutput fail . This is formalized with the following rules: rl [Catch] : G[catch(e)] x => CATCH(G[e] x) .rl [Catch1] : CATCH(x) => x .rl [Catch2] : CATCH( fail ) => fail .rl [Catch3] : CATCH( error ) => fail . As in the previous cases, the ﬁrst rule moves from thesyntactic level to the semantic operator at the level of

State s.The last three rules act on normal forms materializing theintuition given above: error s are transformed into fail ures.

Example 4.3.

Consider the production rules for labeled andjump statements in Example 4.1 and the rules below:

S1 <- catch(labeledSt / jumpSt) / assignStS2 <- catch(labeledSt) / jumpSt / assignStassignSt <- ID EQ expr SEMICOLON

Since the token ID appears in both assignSt and labeledSt , error s must be conﬁned to guarantee that the check(.) an-notations do not modify the language recognized. S1 and S2 are two alternative solutions. Consider the input string“ x = 3 ; ”. In S1 , labeledSt produces error and jumpSt is notevaluated (global behavior). This error is conﬁned in S1 (lo-cal behavior) and assignSt successfully recognizes the input.A similar situation happens in S2 but, when labelSt fails, jumpSt is also evaluated (and fails). Consider now the input“ goto l : ” (note the “:” instead of “;”). In S1 , the failure of jumpSt is conﬁned and assignSt is (unnecessarily) evaluated.In S2 , such failure preempts the execution of assignSt andfails faster.Some other transformations on this grammar are possibleto control better the shared ID token between labeledSt and assignSt . For instance, it is possible to join together assign-ments and label statements, with the clear disadvantage ofmaking the grammar more diﬃcult to read and understand.The use of catch expressions is common in programminglanguages, thus making the grammar above more compre-hensible. LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte 𝐺 [ 𝑒 ] 𝑥 PEGec { 𝑦𝐺 [ catch ( 𝑒 )] 𝑥 PEGec { 𝑦 ( catch . ) 𝐺 [ 𝑒 ] 𝑥 PEGec { fail 𝐺 [ catch ( 𝑒 )] 𝑥 PEGec { fail ( catch . ) 𝐺 [ 𝑒 ] 𝑥 PEGec { error 𝐺 [ catch ( 𝑒 )] 𝑥 PEGec { fail ( catch . ) Figure 3.

Semantic rules for the catch mechanism.We shall use

PEGec → to denote the resulting relation extend-ing PEGe → with the rules above. Figure 3 depicts the correspond-ing rules in natural-semantic style. Unlike ↑ , the ⇑ operator has its own meaning independentof the context where it appears. In particular, it does notneed to be used only inside a choice operator. For instance,the three expressions ⇑ , 𝑎 ⇑ and 𝑎 ⇑ 𝑏 are all valid expres-sions (whose language is empty). It is also worth noticingthe diﬀerence on how the two cut operators deal with theremaining alternatives: when the expression check(e) / e' isevaluated, the second alternative e' is only discarded when e actually fails. Diﬀerently, the expression ↑ 𝑒 / 𝑒 ′ discardseagerly the second alternative, thus saving some space. Onethen may wonder whether it is possible to reconcile the ideaof saving memory that motivated the introduction of cuts in[20] with the behavior proposed here for ⇑ . It turns out thatthe rewriting logic framework can give us some ideas onhow to do that as described below.We ﬁrst introduce an alternative version of check : op try : Exp -> Exp . --- Syntactic levelop TRY : State -> State . --- Semantic level Unlike check(.) , try(.) is not a derived operator and hence,its meaning needs to be speciﬁed: rl [Try] : G[try(e)] x => TRY(G[e] x) .rl [Try1] : TRY(x) => x .rl [Try2] : TRY( fail ) => error .rl [Try3] : TRY( error ) => error . Note the duality between

TRY and

CATCH : the former con-verts fail ures into error s and the latter maps error s into fail ures. Intuitively, try(e) succeeds if e succeeds and failswith error if e fails. We shall use PEGtc → to denote the extensionof PEGec → with the rules for TRY . The additional rules are alsodepicted in Fig. 4.As a simple example, the languages recognized by the ex-pressions "a" and try("a") are the same. However, the ﬁnalfailing states are diﬀerent on input "b": try("a") ends withan error (that nobody caught).The following equivalences are an easy consequence fromthe deﬁnition of

PEGtc → (and determinism):1. catch(try(e)) ≡ e (cancellation)2. catch(e) catch(e') ≡ catch(e.e') . (distributivity) 𝐺 [ 𝑒 ] 𝑥 PEGtc { 𝑦𝐺 [ try ( 𝑒 )] 𝑥 PEGtc { 𝑦 ( try . ) 𝐺 [ 𝑒 ] 𝑥 PEGtc { fail 𝐺 [ try ( 𝑒 )] 𝑥 PEGtc { error ( try . ) 𝐺 [ 𝑒 ] 𝑥 PEGtc { error 𝐺 [ try ( 𝑒 )] 𝑥 PEGtc { error ( try . ) Figure 4.

Semantic rules for the try mechanism.3. catch(! e) ≡ !catch(e) (distributivity)Note that, in general, catch(.) does not distribute on choices.As a simple counterexample, catch (⇑ / 𝜖 ) always reducesto fail while catch (⇑)/ catch ( 𝜖 ) succeeds on any string.Note also that, even if catch ( try ( 𝑒 )) and 𝑒 recognize thesame language ( catch ( 𝑒 ) and try ( 𝑒 ) accepts 𝑥 iﬀ 𝑒 does),their failing states are diﬀerent. Hence, 𝑒 / 𝑒 ′ is not necessar-ily equivalent to the expression catch ( try ( 𝑒 ))/ 𝑒 ′ (i.e, ≡ isnot a congruence).Now let us come back to the problem of saving the spaceneeded to store backtracking information that will never beused. As explained in §3, equations in the equational the-ory can be used to “simplify” terms. The idea is to extendthe equational theory, so that we can add some structuralrules that govern the state of the parser. The simpliﬁcationproposed is the following: eq [simpl] : CHOICE(TRY(S), S') = TRY(S). Here, the whole state S' can be ignored since it will neverbe evaluated. Note that this simpliﬁcation is sound due tothe semantics of CHOICE : if 𝑆 fails then, TRY(S) reduces to error and S' will never be evaluated. Note that such simpliﬁca-tion applies only when TRY is in the immediate scope of a

CHOICE . Moreover, [simpl] may simplify several choices. Forinstance, in this theory,

CHOICE(CHOICE(TRY(S), S'), S'') is equalto

TRY(S) (discarding S' and S'' ).Let

PEGtc → ′ be the extension of PEGtc → with the equation above. Theorem 4.4.

G[e] x

PEGtc → ! S iﬀ G[e] x

PEGtc’ → ! S .Proof. Observe that the term

TRY(S) either reduces to 𝑥 or failwith label error . In the ﬁrst case, CHOICE(TRY(S), S') reducesto 𝑥 and, in the second case, reduces to error . Hence, theresult depends only on the outcome of TRY(S) . (cid:3) In this section we present some benchmarks performed ongrammars manually annotated with the cut constructors pro-posed here. When annotating a grammar, given a concate-nation 𝑒 𝑒 , the general idea is to annotate the symbols in 𝑒 since those in 𝑒 match at least one input symbol (if 𝑒 is notnullable). When 𝑒 matches the input, we are in a (local orglobal) right path, so we can safely discard other alternativepaths and avoid backtracking when 𝑒 fails.The Maude’s speciﬁcation can be seen as an abstract ma-chine whose derivation steps correspond, approximately, to Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA the operations performed by a parser. We shall report thenumber of entry rules that need to be applied to reduce 𝐺 [ 𝑒 ] 𝑥 into a ﬁnal state. Such number corresponds to thenumber of times a PEG constructor is evaluated. For instance,if 𝐺 = { 𝐴 ← 𝑎 } , the expression 𝐺 [ 𝐴𝑏 ] 𝑎𝑏 reduces to 𝜖 in4 steps: × for sequential composition, × for the terminalrule and × for the non-terminal rule. With this methodol-ogy, it is not surprising that, on valid inputs, the annotatedgrammar performs more steps since entering on try ( 𝑒 ) alsocounts as a step. We also report some time measures to showthat it is feasible to use our formal framework for concreteexperiments. JSON . The main non-terminal of the JSON grammar triesall the possible shapes for a value: value <- str / num / obj / arr / "true" / ...

Strings start with (simple) quotes, numbers with digits,objects with “ { ”, arrays with “ [ ”, etc. Those tokens are uniqueand, once consumed, a failure in the remaining expressionshould make the whole expression, including value , to fail.This allows us to introduce some cuts, for instance, obj <- '{' pair (',' pair)* try('}') / '{' try('}') Hence, an open curly bracket without the correspondingclosing one causes a global failure. It is also possible to re-place the second occurrence of pair with try(pair) : once ',' is consumed, the parser cannot fail in recognizing a pair.We took some ﬁles from repositories benchmarking JSONparsers and produced random invalid ﬁles by word muta-tions [24]. More precisely, we randomly deleted some (max-imum 10) symbols ']' , '}' , ':' and ',' in the correspondingﬁle. For each of the 9 (correct) ﬁles, we generated 10 invalidﬁles.The results are in Table 1. The columns are: the size of theﬁle; the sum of the steps in the 10 cases for the grammar andthe annotated grammar; the percentage of reduced steps( . − 𝐺𝑐𝑢𝑡 / 𝐺 ); the number of steps in the grammar andthe annotated grammar when processing the original (valid)ﬁle; and the percentage of the steps increased ( . − 𝐺𝑐𝑢𝑡 / 𝐺 ).Due to the format of JSON ﬁles, the same production ruleis applied several times. Hence, if an error occurs towardsthe end of the ﬁle, the try (·) constructor will be invokedseveral times and, only in the end, it will save some back-tracks. On a MacBook Pro (4 cores 2,3 GHz, 8GB of RAM)running Maude 3.0, processing in batch the 9 valid ﬁles andthe 90 invalid ﬁles with the two grammars takes 4.76s (aver-age of 10 runs). Pallene.

Consider now the grammar for Pallene [11], a statically-typed programming language derived from Lua. It is inter-esting to see the granularity we can achieve with the cutmechanism proposed here. For instance, in the rules import <- 'local' try(name) try('=') 'import' ...foreign <-'local' try(name) try('=') 'foreign' ...'import' does not raise an error when failing (and other al-ternatives are still available) while a failure in name is global.Compare this behavior with the annotation 𝑒 ↑ 𝑒 ′ : after ↑ ,it is not possible to control which sub-expressions of 𝑒 ′ canbe considered as normal failures (where alternatives cannotbe discarded) or unrecoverable errors.We took 74 invalid small programs ( < bytes per ﬁle)that were proposed by the designers of Pallene as bench-marks to test some speciﬁc features of the language. We alsoprocessed 87 correct ﬁles from the same repository.The results are in Table 2. Maude took 2.67s to process the322 tests (valid and invalid ﬁles with the two grammars). C89.

Finally, we considered the grammar of C89. Here someexamples of annotations: enumerator <- ID '=' try(const-exp) / IDcase <- "case" try(const-exp ':' stat)

When enumerating the elements of a enum expression, if af-ter a name there is an an assignment operator (’=’), then anexpression must come after it. The second rule fails whenthe token case is consumed and any of the remaining ex-pressions fail.We took fragments of the ﬁles trace.c , tree-diff.c , version.c , walker.c in the implementation of Git ( https://github.com/git/git ).For each case, we generated 10 diﬀerent ﬁles by randomlydeleting curly brackets, parentheses and semicolons. The re-sults are in Table 1. In this case, Maude took 91.3s (avg. of10 runs) to process the whole collection of ﬁles. Hutton introduced the use of a new parser combinator tomake a distinction between an error and a failure duringthe parsing process [12].The idea of deﬁning multiple kinds of failures for PEGswas proposed in [13]. In this work, the authors introducedlabeled failures as a mechanism to improve syntactic errorreporting in PEGs by associating a diﬀerent error messagefor each label. This ﬁrst formalization provided a kind ofchoice operator that could handle set of labeled failures. Theselabeled choices can also be used as prediction mechanism,where a label indicates which alternative of a choice shouldbe tried [14], allowing PEGs to simulate the LL(*) algorithmused previously by ANTLR [23].However, this ﬁrst labeled failures formalization makesharder to recover from an error in its own context, sincewhen a label is handled in a choice, the information aboutthe error context is lost. Because of this, attempts to dealwith error recovery in PEGs favored a semantics where the ⇑ operator itself handles the error [6, 16]. To recover from LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte

Table 1.

Experiments with JSON and C89.

JSONFile Non-valid inputs Valid inputsBytes Grammar Grammar+cuts % Grammar Grammar+cuts %

276 9,545 7,339 23.1% 2,096 2,103 -0.3%872 12,682 10,141 20.0% 6,968 6,989 -0.3%1k 34,149 29,092 14.8% 8,276 8,301 -0.3%2k 27,075 23,180 14.4% 12,718 12,745 -0.2%4k 46,611 42,614 8.6% 22,993 23,042 -0.2%2k 112,805 108,687 3.7% 25,439 25,452 -0.1%6k 61,323 57,369 6.4% 39,700 39,802 -0.3%10k 72,036 68,265 5.2% 62,971 63,082 -0.2%14k 117,316 113,892 2.9% 90,022 90,174 -0.2%

C89 (Git ﬁles)File Non-valid inputs Valid inputslines Grammar Grammar+cuts % Grammar Grammar+cuts % trace (329) 3,770,039 3,692,000 2.1% 1,222,637 1,224,472 -0.2%tree (131) 1,481,679 1,317,197 11.1% 240,599 240,995 -0.2%version (35) 826,840 798,336 3.4% 165,720 165,978 -0.2%walker (144) 4,177,590 4,051,591 3.0% 787,573 789,148 -0.2%

Table 2.

Experiments for Pallene

74 non-valid ﬁles 124,130 102,056 17.8%87 valid ﬁles 270,783 272,611 -0.7%an error, the ⇑ operator tries to match a recovery expression(a regular parsing expression).The present work uses the idea of labeled failures to makea distinction between local errors and global errors, whereboth kind of errors may avoid unnecessary backtrackingwhen compared to a regular failure. The catch and try mech-anisms can be simulated by using the error recovery seman-tics of the ⇑ operator. In the case of catch (·) , the recoveryexpression of ⇑ should be an expression that always fail (e.g., ! 𝑎𝑎 ), thus resulting in a regular failure. In case of try (·) , wesimply do not provide a recovery expression for ⇑ . Although ⇑ can simulate catch(.) and try(.) , the use of these speciﬁcconstructs helped us to show how to control local and globalbacktracks in PEGs, and to properly formalize the cut mech-anism proposed in [20].Rewriting logic has been extensively used for specifyingand verifying diﬀerent systems [18]. In the context of pro-gramming languages, it is worth mentioning the K Frame-work [27]. Symbolic techniques in rewriting are currentlythe focus of intensive research (see a survey in [7]). In fact,one of the inspirations of this work came from an exam-ple of CFGs reported in [1] (and also used in [7]). Roughly, narrowing (rewriting with logical variables as in logic pro-gramming) was used to explore symbolically the state of programs and apply partial evaluation [5] (a program trans-formation) to improve the eﬃciency of Maude’s speciﬁca-tions. We have proposed a framework based on rewriting logic toformally study the behavior of PEGs. Relying on this ma-chinery, we proposed a general view of cuts where localand global failures can be treated uniformly and with a clearsemantics. Such operators have a pleasant similarity to theusual try/catch expressions in programming languages, thusmaking it easier to understand and predict their behaviorwhen designing a grammar. Our speciﬁcation is not only aformal theory of the derivability relation in PEGs, but it isalso an executable system. Based on it, we have tested someoptimizations on grammars.In §B, we report on a preliminary attempt of using therewrite logic theory developed here, together with symbolictechniques, to propose a derivative parser that can be usedas basis for other analyses. In particular, we show that itis possible to generate all the strings, up to a given length,from a grammar. This should provide more tools for PEG’suser to verify whether a grammar correctly models a givenlanguage of interest. We still need to reﬁne this work and tofurther investigate how it compares to other works that ex-plored the use of derivatives in PEGs [9, 21] and in CFGs [19].We also foresee to use the symbolic traces to generate pos-itive and negative cases for a grammar [24]. It is also worthexploring if the symbolic semantics in §B can be useful for

Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA proving language equivalence on restricted fragments of PEGs(such problem is undecidable in general [8]).The manual insertion of catch and try expressions in agrammar may be a tedious and error prone task. Moreover,it may hinder the grammar clarity. Fortunately, a good amountof theses annotations can be done automatically. We arecurrently exploring diﬀerent automatic labeling algorithmsinspired by those in [20] and [6]. Our framework can behandy in proving the correctness of the resulting algorithm(i.e., the language is preserved after the annotation). Thisapproach based on the automatic annotations should pre-serve the grammar clarity while still avoiding some amountof unnecessary backtracking.Finally, one of the main reasons for introducing cuts inPEGs is to save memory in packrat parsers [20]. The cutsproposed here generalize this idea and avoid further back-tracks. Since our speciﬁcation does not build a memoizationtable, measuring the usage memory when running Maude ispointless. Hence, it may be worth implementing global cutsin a packrat parser in order to evaluate the impact of theoptimization in terms of memory consumption.

A Proofs of Adequacy Theorems

Theorem 3.1 (

PEG ⇒ ! and PEG { coincide) Proof.

Recall that strings ( x ) and fail are normal forms for PEG ⇒ . Also, if 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆 , 𝑆 is either fail or some string 𝑥 .We shall show that 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆 iﬀ G[e]x

PEG ⇒ ! S for any 𝑆 . Thisdischarges both (1) and (2).( ⇒ ). We proceed by induction on the height of the deriva-tion of 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆 . Assume a derivation of height 1. Hence,either 𝑆 = 𝑥 and the rules empty or term.1 were used; or 𝑆 = fail and term.2 or term.3 were used. In the case empty , 𝑥 = 𝑦 , 𝑒 = 𝜖 and clearly G[emp]x

PEG ⇒ x by usingthe rule empty . In the case term.1 , 𝑥 = 𝑎𝑦 and 𝑒 = 𝑎 . Thecorresponding term G[a] ay matches the left-hand side of

Terminal12 , match(a,a) reduces to true and the whole termreduces to y as expected. The cases when 𝑆 = fail are sim-ilar.Assume now that 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑆 is proved with a derivationof height > . He have 9 cases. Let us consider some of themsince the others follow similarly. If var was used, 𝑒 = 𝐴 andthere is a (shorter) derivation of 𝐺 [ 𝑃 ( 𝐴 )] 𝑥 PEG { 𝑆 . By induc-tion, we know that G [P(A)]x

PEG ⇒ ! S . Seen 𝐺 as a set of rules, 𝐺 = 𝐺 ′ ∪ { 𝐴 ← 𝑃 ( 𝐴 )} . If the current expression is A (a non-terminal), the only matching rule on the corresponding term t = (G', A <- P(A)) [A] x is NTerminal that uniﬁes 𝐺 = 𝐺 ′ , 𝑁 = 𝐴 and 𝑒 = 𝑃 ( 𝐴 ) . Hence, 𝑡 PEG ⇒ G[P(A)] x which, in turnsreduces to S . Assume now that the derivation ends with ord.2 . By induction we know that G[e1]x

PEG ⇒ ! fail and also, G[e2]x

PEG ⇒ ! S . There are two rules that match e1 / e2 ( Choice1 and

Choice2 ). However, the side condition in

Choice1 does nothold. Using

Choice2 , we know that

G[e1 / e2]x

PEG ⇒ G[e2]x that,in turns, reduces to S . The other cases follow similarly.( ⇐ ). Assume that 𝑡 = G[e]x

PEG ⇒ ! S . This means that there exists 𝑛 > (since 𝐺 [ 𝑒 ] 𝑥 is not a normal form) and a derivationof the form 𝑡 = 𝑡 PEG ⇒ 𝑡 PEG ⇒ · · ·

PEG ⇒ 𝑡 𝑛 PEG ⇒ . Due to the sideconditions in the rules, each step on this derivation may in-clude the application of several rules. Hence, we proceed byinduction on 𝑚 where 𝑚 is the total number of rules applied(including side conditions) in the above derivation. If 𝑚 = then 𝑛 = and either empty , Terminal12 or Terminal3 were used.The needed derivation in

PEG { results from applying the cor-responding rule. If 𝑚 > we have several cases dependingon the rule applied on 𝑡 . Assume that the derivation startswith Seq1 . This means that 𝑒 = 𝑒 .𝑒 . Moreover, 𝑡 = 𝐺 [ 𝑒 ] 𝑦 and, by the side condition of the rule, G[e1] x => y . By induc-tion, 𝐺 [ 𝑒 ] 𝑦 PEG { 𝑆 . Since y is a normal form, G[e1]x

PEG ⇒ ! y witha smaller number of steps and, by induction, 𝐺 [ 𝑒 ] 𝑥 PEG { 𝑦 .Now we use seq.1 to conclude this case. The other casesfollow similarly. (cid:3) B Symbolic Executions and Derivatives

One useful technique to predict the behavior of a grammaris to automatically generate strings from it and check whetherthe results match the intuitive behavior. There are recentlyworks implementing derivative parsers [3] for PEGs to ac-complish this task [9, 21]. One of the main diﬃculties is pre-cisely the backtrack mechanism in PEGs. Hence, the algo-rithms need to keep track of the diﬀerent branches [21] andcompute possible over approximations [9] of diﬀerent no-tions as testing whether an expression recognizes the emptystring (which is undecidable in general [8]). This sectionshows that our formal speciﬁcation can be adapted to imple-ment a derivative tool. By using constraints, we give a sym-bolic and compact representation of the possible outputs (ofa ﬁxed length) derivable from a grammar.

Constrained Strings . The ﬁrst step is to replace the inputstrings with sequences of constraints 𝑐 .𝑐 ... . Intuitively, 𝑐 𝑖 represents the set of characters allowed in the 𝑖 -th positionof the string. To formalize this idea, we rely on the conceptof constraint systems, commonly used in constraint logicprogramming [28]. A constraint system provides a signa-ture from which constraints can be constructed as well asan entailment relation ⊢ specifying interdependencies be-tween these constraints. A constraint represents a piece of partial information and 𝑐 ⊢ 𝑑 means that information 𝑑 canbe deduced from 𝑐 . In the following deﬁnition, 𝑡 is a terminalsymbol ( TSymbol ), 𝑎 is a character ( TChar ) and 𝑡 𝑐 represents acharacter class ( TExp ). Given a terminal symbol 𝑡 , we shalluse 𝑑𝑜𝑚 ( 𝑡 ) to denote the set of characters allowed by 𝑡 (e.g., 𝑑𝑜𝑚 ( “ 𝑥 ” ) = { 𝑥 } , 𝑑𝑜𝑚 ( [0-9] ) = { , , ..., } , etc.). LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte

Deﬁnition B.1 (Constraint System) . Constraints, usuallyranged over 𝑐 , 𝑐 , 𝑒𝑡𝑐. are built from: 𝑐 :: = tt | ff | 𝑡 |∼ 𝑡 | 𝑐 ∧ 𝑐 The entailment relation ⊢ is the least relation closed bythe rules of intuitionistic logic and the following axioms: 𝑡 ∧ 𝑡 ′ ⊢ ff whenever 𝑑𝑜𝑚 ( 𝑡 ) ∩ 𝑑𝑜𝑚 ( 𝑡 ′ ) = ∅ 𝑡 ∧ ∼ 𝑡 ′ ⊢ ff whenever 𝑑𝑜𝑚 ( 𝑡 ) ∩ 𝑑𝑜𝑚 ( 𝑡 ′ ) = ∅ 𝑡 ⊢ 𝑡 ′ whenever 𝑑𝑜𝑚 ( 𝑡 ) ⊆ 𝑑𝑜𝑚 ( 𝑡 ′ ) Constrained string are sequences of constraints and 𝑐 𝑛 de-notes the strings containing 𝑛 copies of 𝑐 .Let us give some intuitions. The entailment ⊢ [ − ] holds since is stronger (i.e., it constraints more) than [0-9].When more constraints are added, the set of characters al-lowed decreases. Hence, conjunction corresponds to inter-section of domains ( 𝑑𝑜𝑚 ( 𝑐 ∧ 𝑐 ′ ) = 𝑑𝑜𝑚 ( 𝑐 ) ∩ 𝑑𝑜𝑚 ( 𝑐 ′ ) ). ff isthe strongest constraint (since ff ⊢ 𝑐 for any 𝑐 ) and it rep-resents an inconsistent state (with empty domain). tt is theweakest (i.e., 𝑐 ⊢ tt for any 𝑐 ) constraint and it is equiva-lent to [ . ] . ∼ 𝑡 can be interpreted as the complement of thedomain of 𝑡 . Finally, note that the constraint “ 𝑥 ” ∧ [ − ] (resp. “3” ∧ ∼ [0-9]) is inconsistent in virtue of the ﬁrst (resp.second) axiom of ⊢ .The above deﬁnition gives rise to the expected signature: --- Atomic constraints and constraintssort AConstraint Constraint .subsort TSymbol < AConstraint < Constraint .op ~_ : TSymbol -> AConstraint .ops ff tt : -> Constraint . --- True and Falseop _/\_ : Constraint Constraint -> Constraint . Our derivative parser will reﬁne, monotonically, cons-trainedstrings. This means that, in each step, more information/-constraints will be added. In order to keep the output asshort/readable as possible, we deﬁne some simpliﬁcationson constraints. For example, if 𝑑 ⊢ 𝑐 , then 𝑐 ∧ 𝑑 can be sim-pliﬁed to 𝑑 (since 𝑑 contains more information than 𝑐 ): eq c /\ ff = ff . --- ff entails any ceq c /\ tt = c . --- c entails tteq c /\ c = c . --- idempotencyeq a /\ a' = if a == a' then a else ff fi .eq a /\ ~ a' = if a == a' then ff else a fi . For instance: “ 𝑎 ” ∧ [ 𝑎 − 𝑧 ] reduces to “ 𝑎 ” ; “ 𝑎 ” ∧ “ 𝑏 ” re-duces to ff ; “ 𝑎 ” ∧ ∼ “ 𝑏 ” reduces to 𝑎 ; etc. More generally, 𝑐 ∧ 𝑐 ′ = 𝑐 ′ whenever 𝑑𝑜𝑚 ( 𝑐 ′ ) ⊆ 𝑑𝑜𝑚 ( 𝑐 ) 𝑐 ∧ 𝑐 ′ = ff whenever 𝑑𝑜𝑚 ( 𝑐 ) ∩ 𝑑𝑜𝑚 ( 𝑐 ′ ) = ∅ Now we deﬁne constrained strings: sort CStr . subsort Constraint < CStr .op _._ : Constraint CStr -> CStr [ctor right id: nil] .eq ff . c . x = ff . --- Simplifying inconsistencieseq c . ff . x = ff .

Note that, e.g., the constrained string [0-9] . ff . “ 𝑎 ” ... col-lapses into ff . Derivative Rules.

The derivative parsing is obtained by ad-justing the speciﬁcation in §3. First, states take the form

G[e] x :: y meaning that the constrained string y was al-ready processed and x is still being processed. Moreover, theﬁnal states take the form ok(x,y) ( y was successfully con-sumed and the non-consumed input is x ) and fail(x,y) ( x could not be processed and failed). The main change occursin the terminal rules: rl [Terminal] : G[t] (c.x) :: y => ok(x , ins( c /\ t , y)) .rl [Terminal] : G[t] (c.x) :: y => fail( (c /\ ~ t).x , y) . Note that, unlike the theory in in §3, here we have somenon-determinism. Under input 𝑐.𝑥 (a constrained string) wehave two possibilities: either the parser succeeds consuming 𝑐 and adding to it the fact that the string must start with acharacter matching 𝑡 ; or, it fails and the string must startwith a character in 𝑑𝑜𝑚 ( 𝑡 ) .The other rules must be adjusted accordingly to carry thehistory of constraints accumulated so far. For instance, therule for failures in choices becomes: rl [Choice] : CHOICE( fail(x, y) , G[e'] x':: y') =>G[e'] conj(x', concat(y, x)) :: y' . This means that, if the choice failed, the second alternative 𝑒 ′ must be evaluated on a constrained string with the ad-ditional information accumulated during the failure of theﬁrst alternative 𝑒 . This is the purpose of the function 𝑐𝑜𝑛 𝑗 that simply applies point-wise conjunction on the elementsof the strings. Example B.2 (Symbolic outputs) . For each expression, weshow the resulting ﬁnal states of the form ok(x,y) . For read-ability, instead of ok(x,y) we write 𝑦 :: 𝑥 ( 𝑦 was consumedand 𝑥 was returned). Let’s start with some simple cases: - [ − ] 𝑎 . Only one solution: [0-9].a :: tt. . This means thatany valid string must start with 𝑥 ∈ 𝑑𝑜𝑚 ( [ − ]) , continuewith 𝑎 and then, any symbol is valid (for all 𝑥 , 𝑥 ∈ 𝑑𝑜𝑚 ( tt ) ). - ! 𝑎 𝑏 ( 𝑐 / 𝑑 ) . Maude returns two solutions: (~ a /\ b).c :: tt(~ a /\ b). (~ c /\ d) ::tt which further simpliﬁes to b.c::tt and b.d::tt . - ! 𝑎 𝑎 : no solution Consider now the rule:

NUMBER <- [0-9]+ . ("." . ( ! "." . [0-9])+)? .

For strings of at most 3 elements, we have 7 solutions: [0-9]::(~ "." /\ ~ [0-9]).tt --- ex. "3ax"[0-9].[0-9]::(~ "." /\ ~ [0-9]) --- ex. "23x"[0-9].[0-9].[0-9] --- ex "123"[0-9]::("." /\ ~ [0-9]) ."." --- ex "1.."[0-9]::("." /\ ~ [0-9]).(~ "." /\ ~ [0-9])--- ex."1.a"

Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USA [0-9].[0-9]::("." /\ ~ [0-9]) --- ex. "12."[0-9].("." /\ ~ [0-9]) .([0-9] /\ ~ ".") --- ex. "1.2"

After “::” we have the part of the string not consumed. Theﬁrst output reads: the ﬁrst character must be a number (andit is comsumed); then, the second character cannot be a digit,nor “.”; and the last element can be any symbol. Then, e.g.,the string “3ax” is accepted retuning the suﬃx “ax”.The PEG for 𝑎 𝑛 𝑏 𝑛 𝑐 𝑛 in the end of §3 returns, as uniquesolution, the expected strings. As already noticed in [9], thegrammar for the same language proposed in [8] is incorrect.For a length of 6, our tool ﬁnds the following solutions: a a b b c ca a a a a aa a a a b c The above symbolic strings are generated by using thesearch facilities in Maude: search [n] G[e] tt^n =>* ok(x',y') such that c . meaning “compute the ﬁrst 𝑛 states of the form ok(x',y') that satisfy the condition c and can be obtained by constrain-ing the string tt 𝑛 ”. Note that rewriting is not enough sincethe theory is not longer deterministic and diﬀerent pathsmust be considered in the Terminal rules. The search com-mand implements a breadth-ﬁrst search procedure and, there-fore, no solution is lost. Needless to say that the search spacemay grow very quickly. Hence, either the condition 𝑐 is usedto ﬁlter some of the solutions (e.g., compute the symbolicstrings that start with a digit) or speciﬁc/localized rules ofthe grammar are analyzed independently.We shall use sPEG → to denote the induced rewrite relationusing the rules above. For a constrained string 𝑠 = 𝑐 .𝑐 . · · · and a string 𝑥 = 𝑎 .𝑎 ... , we shall write 𝑥 (cid:22) 𝑠 iﬀ for each 𝑖 , 𝑎 𝑖 ∈ 𝑑𝑜𝑚 ( 𝑐 𝑖 ) (i.e., 𝑎 𝑖 is a legal character for 𝑐 𝑖 ).The next result shows that all the possible strings thatcan be generated from an expression 𝑒 are covered by thesymbolic output and, moreover, all instances of a symbolicoutput are indeed valid outputs for 𝑒 . Theorem B.3 (Adequacy) . For all 𝐺 , 𝑒 , 𝑥 and 𝑦 : Soundness: If 𝐺 [ 𝑒 ] 𝑥𝑦 PEG { 𝑦 then, there exists 𝑠 and 𝑠 s.t. 𝐺 [ 𝑒 ] tt | 𝑥𝑦 | sPEG → ! ok ( 𝑠 , 𝑠 ) and 𝑥𝑦 (cid:22) 𝑠 :: 𝑠 . Completeness: If 𝐺 [ 𝑒 ] tt 𝑛 sPEG → ! ok ( 𝑠 , 𝑠 ) then, for all 𝑥𝑦 (cid:22) 𝑠 :: 𝑠 , 𝐺 [ 𝑒 ] 𝑥𝑦 PEG { 𝑦 .Proof. We shall prove the correspondence between sPEG → and PEG → . By Theorems 3.1 and 3.2 , the result extends to PEG { .We shall consider an alternative version of PEG → that, on fail-ures, returns the non-consumed input. Hence, 𝐺 [ 𝑒 ] 𝑥𝑦 PEG → fail ( 𝑦 ) means that 𝑥 (a possible empty string) was con-sumed and 𝑦 could not be recognized. This will simplify the arguments below. Our proof considers the failing cases, i.e.,we shall prove the following: • Soundness: If 𝐺 [ 𝑒 ] 𝑥𝑦 PEG → ! 𝑦 then, there exists 𝑠 and 𝑠 s.t. 𝐺 [ 𝑒 ] tt | 𝑥𝑦 | sPEG → ok ( 𝑠 , 𝑠 ) and 𝑥𝑦 (cid:22) 𝑠 :: 𝑠 .Moreover, if 𝐺 [ 𝑒 ] 𝑥𝑦 PEG → ! fail ( 𝑦 ) then 𝐺 [ 𝑒 ] tt | 𝑥𝑦 | sPEG → fail ( 𝑠 , 𝑠 ) and 𝑥𝑦 (cid:22) 𝑠 :: 𝑠 . • Completeness: If 𝐺 [ 𝑒 ] tt 𝑛 sPEG → ok ( 𝑠 , 𝑠 ) then, for all 𝑥𝑦 (cid:22) 𝑠 :: 𝑠 , 𝐺 [ 𝑒 ] 𝑥𝑦 PEG { 𝑦 . Moreover, if 𝐺 [ 𝑒 ] tt 𝑛 sPEG → fail ( 𝑠 , 𝑠 ) , for all 𝑥𝑦 (cid:22) 𝑠 :: 𝑠 , 𝐺 [ 𝑒 ] 𝑥𝑦 PEG { fail ( 𝑦 ) .The main diﬀerence between sPEG → and PEG → is on the termi-nal rules. Namely, sPEG → considers two cases: either the stringcontains the needed terminal symbol 𝑡 or it does not. In thesecond case, the derivation fails and adds the constraint ∼ 𝑡 . Soundness . We proceed by induction on the length of thederivation of 𝐺 [ 𝑒 ] 𝑥 PEG → ! 𝑆 . In the base case, either 𝑒 = 𝜖 or 𝑒 = 𝑎 . In the case of a terminal symbol, we have two pos-sible outcomes: 𝐺 [ 𝑎 ] 𝑥 PEG → 𝑥 ′ (and 𝑥 = 𝑎𝑥 ′ ) or 𝐺 [ 𝑎 ] 𝑥 PEG → fail ( 𝑥 ) (and 𝑥 = 𝑏𝑥 ′ for 𝑏 ≠ 𝑎 or 𝑥 = 𝜖 ). In the successfulcase, 𝑛 = | 𝑥 | ≥ and 𝐺 [ 𝑎 ] tt | 𝑥 | has two possible outcomes: ok ( tt 𝑛 − , 𝑎 ) and fail ((∼ 𝑎 ) tt 𝑛 − , 𝑛𝑖𝑙 ) . Note that 𝑎𝑥 ′ (cid:22) 𝑎 tt | 𝑛 − | . The case when 𝑥 = 𝑏𝑥 ′ fails is considered in thesecond symbolic output: 𝑏𝑥 ′ (cid:22) (¬ 𝑎 ) tt | 𝑛 − | (note that, if 𝑏 ≠ 𝑎 , then 𝑏 ∈ 𝑑𝑜𝑚 (¬ 𝑎 ) ). For the inductive case, we have sev-eral subcases. Consider the case when the derivation startswith Choice . Hence, 𝑒 = 𝑒 / 𝑒 . There are two possible out-comes for 𝑒 and, by induction, both are instances of one ofthe symbolic outputs. The case when 𝑒 succeeds is imme-diate. Consider the case where 𝐺 [ 𝑒 ] 𝑥𝑦𝑧 PEG → ! fail ( 𝑦𝑧 ) and 𝐺 [ 𝑒 ] 𝑥𝑦𝑧 PEG → ! 𝑧 . By induction, 𝐺 [ 𝑒 ] tt 𝑛 sPEG → fail ( 𝑠 𝑦 𝑠 𝑧 , 𝑠 𝑥 ) where 𝑠 𝑦 and 𝑠 𝑧 explain the failure of 𝑒 and 𝑠 𝑥 is in agree-ment with the part of the string consumed. We know that 𝑥 (cid:22) 𝑠 𝑥 and 𝑦𝑧 (cid:22) 𝑠 𝑦 𝑠 𝑧 . By induction, we also know that 𝐺 [ 𝑒 ] tt 𝑛 sPEG → ok ( tt | 𝑧 | , 𝑤 𝑥 𝑤 𝑦 ) . However, the semanticsexecutes 𝑒 on tt 𝑛 ∧ 𝑠 𝑥 𝑠 𝑦 𝑠 𝑧 (and not on tt 𝑛 ). Since the se-mantics only adds new constraints to the sequence of con-straints, we can show that 𝐺 [ 𝑒 ] tt 𝑛 ∧ 𝑠 𝑥 𝑠 𝑦 𝑠 𝑧 sPEG → ok ( 𝑠 𝑧 , ( 𝑤 𝑥 ∧ 𝑠 𝑥 ) ( 𝑤 𝑦 ∧ 𝑠 𝑦 ) . Since 𝑥 (cid:22) 𝑠 𝑥 and 𝑥 (cid:22) 𝑤 𝑥 , then 𝑥 (cid:22) 𝑠 𝑥 ∧ 𝑤 𝑥 .Similarly for 𝑦 and 𝑧 and the result follows. The other casesare similar. Completeness . We proceed by induction on the length ofthe derivation of 𝐺 [ 𝑒 ] tt 𝑛 sPEG → ! 𝑆 . For the base case, considera one-step derivation and assume that 𝑆 = ok ( 𝑠 𝑦 , 𝑐 ) . Hence, 𝑒 = 𝑡 , 𝑐 = 𝑡 and 𝑠 𝑦 = tt 𝑛 − . If 𝑥𝑦 (cid:22) 𝑡 :: 𝑠 𝑦 then 𝑥 ∈ 𝑑𝑜𝑚 ( 𝑡 ) and clearly 𝐺 [ 𝑡 ] 𝑥𝑦 PEG → 𝑦 . If 𝑆 = fail ((∼ 𝑡 ) 𝑠 𝑦 , 𝑛𝑖𝑙 ) then 𝑒 = 𝑡 and 𝑠 𝑦 = tt 𝑛 − . If 𝑥𝑦 (cid:22) (∼ 𝑡 ) 𝑠 𝑦 , 𝑥 ∈ 𝑑𝑜𝑚 (∼ 𝑡 ) andthen, 𝑥 ∉ 𝑑𝑜𝑚 ( 𝑡 ) . This means that 𝑥 does not match 𝑡 and 𝐺 [ 𝑡 ] 𝑥𝑦 PEG → fail ( 𝑥𝑦 ) . LE ’20, November 16–17, 2020, Virtual, USA Sérgio Queiroz de Medeiros and Carlos Olarte

For the inductive case, we have several subcases. Con-sider a derivation that starts with

Sequence : 𝐺 [ 𝑒 .𝑒 ] tt 𝑛 sPEG → ! ok ( 𝑠 𝑧 , 𝑠 𝑥 𝑠 𝑦 ) . By the deﬁnition of sPEG → , weknow that 𝐺 [ 𝑒 ] tt 𝑛 sPEG → ! ok ( 𝑠 𝑦 𝑠 𝑧 , 𝑠 𝑥 ) . Moreover, 𝐺 [ 𝑒 ] 𝑠 𝑦 𝑠 𝑧 sPEG → ! ok ( 𝑠 𝑧 , 𝑠 𝑦 ) . By induction, we deduce 𝐺 [ 𝑒 ] 𝑥𝑦𝑧 PEG → ! 𝑦𝑧 . From a similar observation about con-junction of sequences (as done in the proof of Soundness),we can also show that 𝐺 [ 𝑒 ] 𝑦𝑧 PEG → ! 𝑧 . We then conclude 𝐺 [ 𝑒 .𝑒 ] 𝑥𝑦𝑧 PEG → ! 𝑧 . The other cases are similar. (cid:3) References [1] María Alpuente, Angel Cuenca-Ortega, Santiago Escobar, and JoséMeseguer. 2020. A partial evaluation framework for order-sortedequational programs modulo axioms.

J. Log. Algebraic Methods Pro-gram.

110 (2020). https://doi.org/10.1016/j.jlamp.2019.100501 [2] Roberto Bruni and José Meseguer. 2006. Semantic Foundations forGeneralized Rewrite Theories.

Theoretical Computer Science https://doi.org/10.1016/j.tcs.2006.04.012 [3] Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions.

J.ACM

11, 4 (1964), 481–494. https://doi.org/10.1145/321239.321249 [4] Manuel Clavel, Francisco Durán, Steven Eker, Patrick Lincoln, Nar-ciso Martí-Oliet, José Meseguer, and Carolyn L. Talcott (Eds.).2007.

All About Maude - A High-Performance Logical Frame-work, How to Specify, Program and Verify Systems in RewritingLogic . Lecture Notes in Computer Science, Vol. 4350. Springer. https://doi.org/10.1007/978-3-540-71999-1 [5] Olivier Danvy, Robert Glück, and Peter Thiemann (Eds.). 1996.

Par-tial Evaluation, International Seminar, Dagstuhl Castle, Germany, Feb-ruary 12-16, 1996, Selected Papers . Lecture Notes in Computer Science,Vol. 1110. Springer. https://doi.org/10.1007/3-540-61580-6 [6] Sérgio Queiroz de Medeiros, Gilney de Azevedo Alvez Junior, andFabio Mascarenhas. 2020. Automatic syntax error reporting and re-covery in parsing expression grammars.

Sci. Comput. Program. https://doi.org/10.1016/j.scico.2019.102373 [7] Francisco Durán, Steven Eker, Santiago Escobar, Narciso Martí-Oliet,José Meseguer, Rubén Rubio, and Carolyn L. Talcott. 2020. Program-ming and symbolic computation in Maude.

J. Log. Algebraic MethodsProgram.

110 (2020). https://doi.org/10.1016/j.jlamp.2019.100497 [8] Bryan Ford. 2004. Parsing expression grammars: a recognition-based syntactic foundation. In

Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL2004, Venice, Italy, January 14-16, 2004 , Neil D. Jones and Xavier Leroy(Eds.). ACM, 111–122. https://doi.org/10.1145/964001.964011 [9] Tony Garnock-Jones, Mahdi Eslamimehr, and Alessandro Warth. 2018.Recognising and Generating Terms using Derivatives of Parsing Ex-pression Grammars.

CoRR abs/1801.10490 (2018). arXiv:1801.10490 http://arxiv.org/abs/1801.10490 [10] Robert Grimm. 2006. Better extensibility through modular syntax.In

Proceedings of the ACM SIGPLAN 2006 Conference on ProgrammingLanguage Design and Implementation, Ottawa, Ontario, Canada, June11-14, 2006 , Michael I. Schwartzbach and Thomas Ball (Eds.). ACM,38–51. https://doi.org/10.1145/1133981.1133987 [11] Hugo Musso Gualandi and Roberto Ierusalimschy. 2018. Pallene: astatically typed companion language for lua. In

Proceedings of the XXIIBrazilian Symposium on Programming Languages, SBLP 2018, Sao Car-los, Brazil, September 20-21, 2018 , Carlos Camarão and Martin Sulz-mann (Eds.). ACM, 19–26. https://doi.org/10.1145/3264637.3264640 [12] Graham Hutton. 1992. Higher-order functions for pars-ing.

Journal of Functional Programming

2, 3 (1992), 323–343. https://doi.org/10.1017/S0956796800000411 [13] André Murbach Maidl, Fabio Mascarenhas, and Roberto Ierusal-imschy. 2013. Exception Handling for Error Reporting in Pars-ing Expression Grammars. In

Programming Languages - 17thBrazilian Symposium, SBLP 2013, Brasília, Brazil, October 3 -4, 2013. Proceedings (Lecture Notes in Computer Science) , AndréRauber Du Bois and Phil Trinder (Eds.), Vol. 8129. Springer, 1–15. https://doi.org/10.1007/978-3-642-40922-6_1 [14] André Murbach Maidl, Fabio Mascarenhas, Sérgio Medeiros, andRoberto Ierusalimschy. 2016. Error reporting in Parsing Ex-pression Grammars.

Sci. Comput. Program.

132 (2016), 129–140. https://doi.org/10.1016/j.scico.2016.08.004 [15] Fabio Mascarenhas, Sérgio Medeiros, and Roberto Ierusalimschy.2014. On the relation between context-free grammars and pars-ing expression grammars.

Sci. Comput. Program.

89 (2014), 235–250.

Semantic Framework for PEGs SLE ’20, November 16–17, 2020, Virtual, USAhttps://doi.org/10.1016/j.scico.2014.01.012 [16] Sérgio Medeiros and Fabio Mascarenhas. 2018. Syntax er-ror recovery in parsing expression grammars. In

Proceedingsof the 33rd Annual ACM Symposium on Applied Computing,SAC 2018, Pau, France, April 09-13, 2018 , Hisham M. Haddad,Roger L. Wainwright, and Richard Chbeir (Eds.). ACM, 1195–1202. https://doi.org/10.1145/3167132.3167261 [17] José Meseguer. 1992. Conditional Rewriting Logic as a Uniﬁed Modelof Concurrency.

Theoretical Computer Science

96, 1 (1992), 73–155. https://doi.org/10.1016/0304-3975(92)90182-F [18] José Meseguer. 2012. Twenty Years of Rewriting Logic.

The Jour-nal of Logic and Algebraic Programming

81, 7-8 (Oct. 2012), 721–781. https://doi.org/10.1016/j.jlap.2012.06.003 [19] Matthew Might, David Darais, and Daniel Spiewak. 2011. Pars-ing with derivatives: a functional pearl. In

Proceeding of the 16thACM SIGPLAN international conference on Functional Programming,ICFP 2011, Tokyo, Japan, September 19-21, 2011 , Manuel M. T.Chakravarty, Zhenjiang Hu, and Olivier Danvy (Eds.). ACM, 189–195. https://doi.org/10.1145/2034773.2034801 [20] Kota Mizushima, Atusi Maeda, and Yoshinori Yamaguchi. 2010. Pack-rat parsers can handle practical grammars in mostly constant space. In

Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on ProgramAnalysis for Software Tools and Engineering, PASTE’10, Toronto, On-tario, Canada, June 5-6, 2010 , Sorin Lerner and Atanas Rountev (Eds.).ACM, 29–36. https://doi.org/10.1145/1806672.1806679 [21] Aaron Moss. 2020. Simpliﬁed Parsing Expression Derivatives. In

Lan-guage and Automata Theory and Applications - 14th International Con-ference, LATA 2020, Milan, Italy, March 4-6, 2020, Proceedings (Lec-ture Notes in Computer Science) , Alberto Leporati, Carlos Martín-Vide,Dana Shapira, and Claudio Zandron (Eds.), Vol. 12038. Springer, 425–436. https://doi.org/10.1007/978-3-030-40608-0_30 [22] Alexander A. Myltsev. 2019. parboiled2: a macro-based ap-proach for eﬀective generators of parsing expressions gram-mars in Scala.

CoRR abs/1907.03436 (2019). arXiv:1907.03436 http://arxiv.org/abs/1907.03436 [23] Terence Parr and Kathleen Fisher. 2011.

LL(*): the foun-dation of the ANTLR parser generator. In

Proceedings of the32nd ACM SIGPLAN Conference on Programming Language De-sign and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011 , Mary W. Hall and David A. Padua (Eds.). ACM, 425–436. https://doi.org/10.1145/1993498.1993548 [24] Moeketsi Raselimo, Jan Taljaard, and Bernd Fischer. 2019. Break-ing parsers: mutation-based generation of programs with guar-anteed syntax errors. In

Proceedings of the 12th ACM SIG-PLAN International Conference on Software Language Engineer-ing, SLE 2019, Athens, Greece, October 20-22, 2019 , Oscar Nier-strasz, Jeﬀ Gray, and Bruno C. d. S. Oliveira (Eds.). ACM, 83–87. https://doi.org/10.1145/3357766.3359542 [25] Roman R. Redziejowski. 2009. Applying Classical Concepts to Pars-ing Expression Grammar.

Fundam. Inform.

93, 1-3 (2009), 325–336. https://doi.org/10.3233/FI-2009-0105 [26] Roman R. Redziejowski. 2015. Mouse: From Parsing Expressions to aPractical Parser. In

Concurrency Speciﬁcation and Programming Work-shop .[27] Grigore Rosu and Traian-Florin Serbanuta. 2014. K Overview andSIMPLE Case Study.

Electron. Notes Theor. Comput. Sci.

304 (2014),3–56. https://doi.org/10.1016/j.entcs.2014.05.002 [28] Vijay A. Saraswat and Martin C. Rinard. 1990. Concurrent ConstraintProgramming. In

Conference Record of the Seventeenth Annual ACMSymposium on Principles of Programming Languages, San Francisco,California, USA, January 1990 , Frances E. Allen (Ed.). ACM Press, 232–245., Frances E. Allen (Ed.). ACM Press, 232–245.