A Verified Packrat Parser Interpreter for Parsing Expression Grammars
AA Verified Packrat Parser Interpreter for ParsingExpression Grammars ∗ Clement Blaudeau
Ecole PolytechniquePalaiseau Cedex, France [email protected]
Natarajan Shankar
Computer Science LaboratorySRI InternationalMenlo Park, CA, USA [email protected]
Abstract
Parsing expression grammars (PEGs) offer a natural opportu-nity for building verified parser interpreters based on higher-order parsing combinators. PEGs are expressive, unambigu-ous, and efficient to parse in a top-down recursive descentstyle. We use the rich type system of the PVS specificationlanguage and verification system to formalize the metatheoryof PEGs and define a reference implementation of a recursiveparser interpreter for PEGs. In order to ensure terminationof parsing, we define a notion of a well-formed grammar.Rather than relying on an inductive definition of parsing, weuse abstract syntax trees that represent the computationaltrace of the parser to provide an effective proof certificate forcorrect parsing and ensure that parsing properties includingsoundness and completeness are maintained. The correct-ness properties are embedded in the types of the operationsso that the proofs can be easily constructed from local proofobligations. Building on the reference parser interpreter, wedefine a packrat parser interpreter as well as an extensionthat is capable of semantic interpretation. Both these parserinterpreters are proved equivalent to the reference one. Allof the parsers are executable. The proofs are formalized inmathematical terms so that similar parser interpreters can ∗ This work was supported by the National Institute of Aerospace AwardC18-201097-SRI, NSF Grant SHF-1817204, Ecole Polytechnique, and DARPAunder agreement number HR001119C0075. The views and conclusionscontained herein are those of the authors and should not be interpretedas necessarily representing the official policies or endorsements, eitherexpressed or implied, of NASA, NSF, DARPA, Ecole Polytechnique, or theU.S. Government. We thank the anonymous referees for their detailedcomments and constructive feedback.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].
CPP ’20, January 20–21, 2020, New Orleans, LA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-7097-4/20/01...$15.00 https://doi.org/10.1145/3372885.3373836 be defined in any specification language with a type systemsimilar to PVS.
CCS Concepts • Theory of computation → Grammarsand context-free languages ; Automated reasoning ; •
Software and its engineering → Syntax . Keywords
PVS, PEG grammar, packrat parsing, semanticparsing, verified parser, abstract syntax tree, well-formedgrammars
ACM Reference Format:
Clement Blaudeau and Natarajan Shankar. 2020. A Verified PackratParser Interpreter for Parsing Expression Grammars. In
Proceed-ings of the 9th ACM SIGPLAN International Conference on CertifiedPrograms and Proofs (CPP ’20), January 20–21, 2020, New Orleans,LA, USA.
ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3372885.3373836
Parsing is the process of extracting structure and infor-mation from a string of tokens according to a formal gram-mar [1]. For critical applications, parsing errors and lack ofproper input validation can be a common source of vulner-ability and inconsistency leading to numerous errors andattacks [6]. Parser issues can come from the failure to fol-low a formal grammar, and from the formal grammar itself.Grammars might, by design or accident, introduce complex-ity, nontermination, and nondeterminism. For example, the dangling else problem where both if A then B and if A thenB else C are well-formed, leads to an ambiguity in parsingan expression of the form if A then if B then C else D wherethe else branch could be associated with either of the twoconditionals. Ford [11] introduced Parsing Expression Gram-mars (PEGs) as a formalism for defining unambiguous gram-mars that can be parsed efficiently using a recursive descentscheme with parsing combinators for the basic grammaroperators [7, 12, 13]. We present a PVS [21] formalization ofPEGs with a rigorous treatment of well-formed grammarsand a reference parser generator for such grammars. We alsoverify a packrat parser generator for PEGs and demonstratethat it is equivalent to the reference parser generator. Ourparse tree representation can serve as an independent proof-of-parse for PEGs making it possible to validate the result a r X i v : . [ c s . L O ] J a n PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar produced by the correct parsing, successful or unsuccess-ful, of an input string with reference to a given grammarregardless of how the parsing was performed. Parsing expression grammars were introduced as a prag-matic compromise between efficiency and expressiveness.These grammars are similar to context-free grammars in sup-porting terminal symbols, concatenation A ; B , and iteration A ∗ , but the choice operation A | B is replaced with a priorityoperation A / B where the grammar B is matched against theinput only when the match on A fails. PEGs also includeoption (?), test (&), and negation operations (!): ? A eithermatches (and consumes) some prefix of the current input orsucceeds without consuming any tokens; & A tests if there isa prefix of the input that matches A without consuming anytokens, and ! A fails if some input prefix matches A . PEGsare unambiguous: there is at most one way to parse the in-put with respect to a given grammar. PEGs are greedy inthat compared to the context-free analog, a PEG grammarrepresents the longest parse. For example, parsing an inputwith respect to A ∗ will consume as many instances of A fromthe input as possible. The test operations can be used to de-fine test predicates that can look ahead into arbitrarily longprefixes of the input without actually consuming any input,making it possible to even capture certain non-context-freegrammars such as a n b n c n .A PEG grammar consists of a set of productions map-ping nonterminals to PEG expressions in the nonterminals.Not all PEGs are well-formed. For example, A ∗ is not well-formed when A can match the empty string. A left-recursivegrammar like A → A / a is not well-formed since the corre-sponding unfolding of the grammar might not converge.The parsing of well-formed PEGs can be directly imple-mented by a recursive descent parser in which each constructis supported by a parsing combinator. For example, the choiceconstruct A / B is implemented by a combinator that takesa parser for A and a parser for B , and invokes the parserfor B on the input only when the parser for A returns withfailure. This kind of parsing scheme can have exponentialcomplexity through repeated parsing queries with the samenonterminal. Consider the following grammar [10]: A :: = M + A / MM :: = G ∗ M / GG :: = ( A )/ int Parsing an input, say ((( ∗ ) ∗ ) ∗ ) ∗ parsercalls for the nonterminal M . This duplication is avoided inpackrat parsing by memoizing the results of the parse. The PVS formalizations can be accessed at https://github.com/SRI-CSL/PVSPackrat . PEGs and PEG parsing is defined in Section 2 where wealso discuss the termination problem for parsing with re-spect to arbitrary PEGs. In Section 3, we outline the well-formedness properties of PEGs that guarantee the termina-tion of parsing. We present the verification of three parserinterpreters for PEGs. The first is a reference parser inter-preter for PEG grammars presented in Section 5. The secondis a packrat parser interpreter described in Section 6. Sec-tion 7 describes a variant that allows the output parse treeto be customized according to semantic actions. Section 8makes some brief concluding observations.
Related Work.
Though parsing is an important firewall be-tween untrusted data and vulnerable applications, there haveonly been a few instances of verified parsers or parser in-terpreters/generators. The parser front-end was one of thefew identifiable weaknesses of the verified CompCert com-piler [18]. Barthwal and Norrish [2] present an SLR parsergenerator verified in HOL4 that produces an independentlyverifiable parse tree. Ridge [23] has verified the termination,soundness, and completeness of a recursive descent parserbased on parsing combinators for context-free languages.The RockSalt checker of Morrisett, Tan, Tassarotti, Tristan,and Gan [20] for checking software-based fault isolation ofnative executable code in a browser employs a regular ex-pression parser for x86 instructions that has been verified tobe sound in Coq. In subsequent work, Tan and Morrisett [24]certified encoder/decoder pairs are constructed from bidirec-tional grammars. Lopes, Ribeiro, and Camarão [19] have alsoverified a regular expression parser using Idris [5]. Bernardyand Jansson [3] have formalized Valiant’s algorithm for pars-ing context-free languages in Agda [4]. Koprowski and Bin-sztok [15] present a Coq verification of a parser interpretercalled TRX for PEGs. The CakeML compiler which has beenverified using HOL4 employs a similar verified PEG parserinterpreter [16]. Lasser, Casinghino, Fisher, and Roux [17]have verified an LL(1) parser generator covering the genera-tion of the lookup table and the stack-based parser. Jourdan,Pottier, and Leroy [14] define a generator for an indepen-dently and verifiably validatable parsing automaton for anLR(1) grammar. A number of papers address the verificationand synthesis of decoder/encoder pairs [8, 22, 25–27] wherethe objective is to serialize and deserialize data.Of the related projects, the TRX parser [15] and the CakeMLparser [16] are closest to the one presented here. We buildon the TRX work, particularly in the treatment of termina-tion for PEGs and in the definition of the non-packrat PEGparser. However, we define an executable check for grammarwell-formedness in contrast to the inductive characteriza-tion in TRX. We also present a parse tree representationof both successful and failed parses. We show that theseparses are unique for a given grammar thus capturing boththe soundness and completeness of the parser interpreter.The TRX verification only captures the soundness argument Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA using an inductively defined parse derivation that capturesthe parse semantics. While parse trees have been used asproofs-of-parse, in the work of Barthwal and Norrish [2]and Ridge [23], we have extended it to capture both successand failure. We also go beyond the reference PEG parserinterpreter to verify two packrat parser interpreters, onewithout semantic actions and one with. The PVS proofs wepresent take advantage of predicate subtyping in PVS to crafta verification methodology that reduces the correctness ofthe grammar analyzer and the parser interpreters to small,local proof obligations with easy automated proofs instead ofbig theorems with manually generated proof structures. Ourproof methodology is transferable to other parsing formatsand parser generation algorithms.
Parsing Expression Grammars (PEGs) were introduced byFord as a formalism for capturing grammars that corre-spond to greedy, unambiguous recursive descent parsing.To describe PEGs, we rely on operators resembling those incontext-free grammars, with the notable exception of the pri-oritized choice operator. Here is a formal definition of PEGs,relying on two types: V T , the type of terminals (in most casesbytes or characters, but applicable to any type instantiat-ing V T ) and V N , the type of nonterminals that are basicallypatterns: the set of grammar expressions ∆ is inductivelydefined as below. ∆ :: = ϵ empty expression | [·] any character | [ a ] a terminal ( a ∈ V T )| A a nonterminal ( A ∈ V N )| e e ( e , e ∈ ∆ )| e / e ( e , e ∈ ∆ )| e ∗ a greedy repetition ( e ∈ ∆ )| ! e a not-predicate ( e ∈ ∆ ) (1)In addition to those basic operators, we have a few otheroperators that can be emulated by the basic ones:1. [ a − z ] the range, equivalent to [ a ]/[ b ]/ ... /[ z ] [ “ s ” ] the string, equivalent to [ c ] ; [ c ] ; ... ; [ c n ] , where s = c c . . . c n for tokens c , . . . , c n e + the plus operator, equivalent to e ; e ∗ e ? the optional operator, equivalent to e / ϵ
5. & e the and operator, equivalent to !! e In practice these additional operators would certainly beused, but as they can be emulated by the basic ones, we canignore them for the proofs in order to avoid redundant casesduring the case analyses.
Restriction to a Finite Set of Nonterminals.
Technicallywe could consider V N as any set, finite or not, and build re-sults making no other assumptions. In practice however, V N is the set of patterns given by the creator of the grammar. So peg: DATATYPEBEGIN ϵ : ϵ ?any: any?terminal (a: V_T): terminal ?nonTerminal (A: V_N): nonTerminal ?seq(e1 , e2: peg): seq?prior (e1 , e2: peg): prior ?star (e: peg): star ?notP (e: peg): notP ?END peg Figure 1.
PVS code for the PEG grammars formalized as analgebraic datatype with constructors, accessors, and recog-nizerswe can consider V N finite and bounded by n + = Card ( V N ) .In the following, V N and (cid:74) , n (cid:75) are used interchangeably. The Problem of Termination.
Though PEGs are unam-biguous [11], this does not ensure that parsing terminates.The most basic non-terminating PEG expression is ϵ ∗ sinceparsing with it loops forever without consuming any input.It shows that greedy operators should not rely on expres-sions that can succeed without consuming tokens from theinput . Also, the use of nonterminals can easily introduce non-terminating left-recursion. For example, with V N = { A , B } and grammar production map P exp from V N to ∆ defined as: P exp ( A ) = BP exp ( B ) = A The parsing here would loop forever. This introduces the twomain considerations regarding the termination of parsing:consumption of characters and prevention of infinite left-recursion. In the following section, we define the propertiesof a well-formed grammar that ensure termination of parsingregardless of the input.
We develop a new approach to well-formed PEGs with acomputational point of view. Our focus is on constructivedefinitions that yield easy implementations. The notions ofstructural well-formedness and pattern well-formedness are,along with the use of abstract proof-of-parse trees, the majordifferences with respect to the verified PEG parser TRX [15].
To characterize terminating grammars, we first need to iden-tify the relevant properties of grammar expressions thatensure termination. We then need to compute them. Themost basic property is success without consumption (of to-kens). However, the not-predicate, which succeeds if theexpression fails, requires us to also know if expressions can fail . The last case is success with consumption of at least one PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar token of the input. We introduce below a precise mathemati-cal formalization of the notion of a grammar property thatcorresponds to the way it is designed and proved in PVS.
Formalization.
To translate those properties, we definethree predicates: P ⊥ , P , P > : ∆ −→ bool , representing whether a parse based on the grammar canfail, succeed without consuming input, or succeed only byconsuming input, respectively. We define P ≥ ( e ) as P ( e ) ∨ P > ( e ) . The inductive definition relies on the rules shown inFigure 2.This inductive approach is valid and matches the intu-ition, and can be easily formalized in PVS. However, wewant an effective computational mechanism for checkingthe tree structure of a grammar expression that might con-tain loops . We can tackle this problem as follows: we firstintroduce the set of already known properties of nontermi-nals , that starts as the empty set. Then we try to computeall the new properties that can be obtained with currentknowledge, and iterate until no new properties are found.The number of computed properties of nonterminals is non-decreasing, and bounded by 3 × | V N | . When all the propertiesof each nonterminal are known, we can compute in a straight-forward recursive way the properties of any grammar in-volving these nonterminals. We formalize this approach be-low by first introducing a function over the nonterminals: P : V N −→ { known , unknown } that represent the knownproperties. Let P be the set of those predicates. In the actualimplementation, we use bool for representation of known/un-known . It is important to notice that in our model, a false doesnot mean that the property is false, but that it is unknown ,so that only true yields useful information. We introduce anorder: ∀ P , P ′ ∈ P , P ≤ P ′ ⇒ ∀ A ∈ V N P ( A )( ) ⇒ P ′ ( A )( ) P ( A )( ) ⇒ P ′ ( A )( ) P ( A )( ) ⇒ P ′ ( A )( ) This order is trivially reflexive and transitive. We can nowintroduce a function that computes the properties of a gram-mar node based on current knowledge: the code is given inFigure 3. д : ∆ × P −→ [ bool ] where P ⊥ : ∆ −→ bool = д (· , P )( ) P : ∆ −→ bool = д (· , P )( ) P > : ∆ −→ bool = д (· , P )( ) This structurally recursive function satisfies the rules fromFigure 2. The termination measure pegMeasure(G) is justthe size of G given by the number of nodes. We also introducea function ρ that takes a nonterminal A and a set of properties With A relying on the properties of B and vice-versa P , compute the properties of A and returns P extended withthe new computed properties: ρ : V N × P −→ P ( A , P ) (cid:55)−→ (cid:26) ρ ( A , P )( A ) = д ( P exp ( A ) , P ) ρ ( A , P )( B ) = P ( B ) ( B (cid:44) A ) Basically, we want to apply this function ρ a certain numberof times (at most 3 × | V N | ) to get all reachable properties. Butthe problem is that this function is not monotonic, for exam-ple, when P ( A ) = ( known , known , known ) and P exp ( A ) = ϵ , ρ ( A , P ) < P . Thus, we need to restrain the set of propertiesto what we call coherent properties : sets of properties thatare not contradicted by themselves. We define that set as: C = { P ∈ P | ∀ A ∈ V N , P ≤ ρ ( A , P )} (2)Under this assumption of coherence, ρ is monotonic. Lemma 3.1 ( ρ is monotonic) . ∀ P , P ′ , G ∈ C × ∆ , P ≤ P ′ ⇒ ρ ( G , P ) ≤ ρ ( G , P ′ ) (3)In the PVS implementation, we define the ρ function, aswell as a coherent properties type, and prove the monotonicityresult. Now that we have a clear formalization of the computationof grammar properties, we can focus on the way all theproperties are recursively computed. Again, the main ideais that we only need the properties of the nonterminals tocompute the properties of any grammar node with a simplerecursive function.
Example.
Consider the example with the following P exp function: P exp ( A ) = [ a ] P exp ( B ) = ! A / CP exp ( C ) = ! B ; A
1. We have P ⊥ ( A ) and P > ( A ) immediately.2. We then get P ⊥ ( ! A ) and P ( ! A ) ⇒ P ( B ) .3. This gives us P ⊥ ( ! B ) ⇒ P ⊥ ( C ) .4. Combining P ⊥ ( ! A ) and P ⊥ ( C ) we get P ⊥ ( B ) .In this example we can see that for computing the propertiesof B we need properties of C and vice-versa: the computationcannot be done in a single pass. Computation Process.
The computation process of the non-terminal properties is the following: It is interesting to notice that far more optimal ways could be invented totackle this computational problem, especially using graphs and memoizationto prevent computing over and over again the same things. But first of all,the aim here is to have a verified solution, and secondly, this computation isonly done once and for all, it does not affect the parsing. Thus the optimalityis not the main focus here.4
Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA P ( ϵ ) P > ([·]) P ⊥ ([·]) a ∈ V T P > ([ a ]) a ∈ V T P ⊥ ([ a ]) P ⊥ ( e ) P ( e ∗) P > ( e ) P > ( e ∗) ⋆ ∈ { , > , ⊥} A ∈ V N P ⋆ ( P exp ( A )) P ⋆ ( A ) P ⊥ ( e ) ∨ [ P ≥ ( e ) ∧ P ⊥ ( e )] P ⊥ ( e ; e ) P ( e ) P ( e ) P ( e ; e ) [ P > ( e ) ∧ P ≥ ( e )] ∨ [ P ( e ) ∧ P > ( e )] P > ( e ; e ) P ( e ) ∨ [ P ⊥ ( e ) ∧ P ( e )] P ( e / e ) P > ( e ) ∨ [ P ⊥ ( e ) ∧ P > ( e )] P > ( e / e ) P ⊥ ( e ) P ⊥ ( e ) P ⊥ ( e / e ) P ⊥ ( e ) P ( ! e ) P ≥ ( e ) P ⊥ ( ! e ) Figure 2.
Inductive rules for grammar properties % Recursively computes the grammar properties ofa peg object% based on a current set of known properties fornon terminalsg_props (G, P) : RECURSIVE [bool , bool , bool ] =CASES G of ϵ : (false , true , false ),any: (true , false , true ),terminal (a): (true , false , true ),nonTerminal (A) : (P `1(A), P `2(A), P `3(A)),seq(e1 , e2): LET (e1_f , e1_0 , e1_s ) = g_props (e1 , P) INLET (e2_f , e2_0 , e2_s ) = g_props (e2 , P) IN( e1_f ∨ (( e1_0 ∨ e1_s ) ∧ e2_f ),e1_0 ∧ e2_0 ,( e1_s ∧ ( e2_0 ∨ e2_s )) ∨ ( e1_0 ∧ e2_s )),prior (e1 , e2): LET (e1_f , e1_0 , e1_s ) = g_props(e1 , P) INLET (e2_f , e2_0 , e2_s ) = g_props (e2 , P) IN( e1_f ∧ e2_f ,e1_0 ∨ ( e1_f ∧ e2_0 ),e1_s ∨ ( e1_f ∧ e2_s )),star (e): LET (e_f , e_0 , e_s) = g_props (e, P) IN(false , e_f , e_s),notP (e): LET (e_f , e_0 , e_s) = g_props (e, P) IN(e_s ∨ e_0 , e_f , false ),ENDCASESMEASURE pegMeasure (G) Figure 3.
PVS implementation of the д function defined bycase analysis. PVS allows the use of the unicode charactersfor and, or, implies, iff .1. Starting with the empty set of properties 0 C , we com-pute the properties of all the nonterminals , one byone, augmenting the set of properties as new ones arefound. 2. Once that is done, we check if new properties havebeen found since the start of the nonterminal computa-tion. If so, we restart the computation, and otherwise,we return the result. Formalization.
Next, we formalize the computation pro-cess. We define a sequence that translates the computationof properties for all the nonterminals, one by one. The prop-erty set on which the computation is made is the superscript,and the nonterminal on which we are trying to compute newproperties is written in subscript. So to recompute the newproperties for all nonterminals between 0 and n we have: (cid:26) r PA = r ρ ( A , P ) A + r Pn = ρ ( n , P ) (4)Lemma 3.2 captures three useful properties entailed by thissequence. Lemma 3.2 (Recomputing nonTerminal properties increasesknowledge) . ∀ P , A ∈ C × (cid:74) , n (cid:75) , P ≤ r PA ∀ P , P ′ , A ∈ C × (cid:74) , n (cid:75) , P ≤ P ′ ⇒ r PA ≤ r P ′ A ∀ P , A ∈ C × (cid:74) , n − (cid:75) , r PA + ≤ r PA (5)The expected result of the computation is a set of proper-ties that cannot be extended (because it already has all thereachable properties). We can define the set of propertiesthat are such fixpoints of computation: F = (cid:8) P ∈ C | P = r P (cid:9) (6)Such a fixpoint can be reached in a bounded (by 3 ∗ ( n + ) )number of steps by repeatedly applying the function ϕ ( P ) = r P starting with 0 C . We have the following results: Theorem 3.3 (Fixpoint properties) . Recomputing the prop-erties of a nonterminal with a set of properties that is a fixpointgives back the same result: ∀ ( P , A ) ∈ F × V N , P = ρ ( A , P ) (7) PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar
In the PVS implementation, we define the r function, aswell as a fixpoint properties type. We prove the lemmas andthe theorem. This gives us an effective way to compute theproperties of a given set of nonterminals. Those properties allow us to define the grammar that we call well-formed : grammars that structurally enforce the termi-nation of the parsing. The main idea is to prevent the twokinds of loops that we mentioned: structural ones ( ϵ ∗ ) andpattern ones ( P exp ( A ) = B , P exp ( B ) = A ). Two approachesare needed:1. Preventing structural aberrations is quite easy: we canjust go through the whole tree and check that everytime the star operator is used, it is applied to a gram-mar node that cannot succeed without consuming anyinput. Thanks to the work previously done on proper-ties computation, this is an easy task.2. Preventing pattern aberrations is a bit trickier, as wewant to allow patterns to use other patterns in someinstances while preventing them from doing so in oth-ers. The idea is the following: we assume there existsan order over the nonterminals where the grammar P exp ( A ) for A can only employ strictly smaller non-terminals until it is clear that at least one characteris consumed. For example, once the left-branch of a seq is not of type P , the right branch can use anynonterminal.A visual representation is given in Figure 4. Formalization.
Given those remarks we see that we candefine a function ω that verifies both structural and patternwell-formedness of a grammar node that is a subterm of P exp ( A ) for a certain A ∈ V N . The function always checksstructural well-formedness, but pattern well-formedness isonly checked on certain branches, using a special argument δ that is true if we enforce pattern well-formedness and thatis false if it is not needed. This approach is actually equivalent to the inductive definition of well-formedness that can be found in [10]: if we have an order on nonterminalsthen the inductive definition can follow that order, and vice-versa For the scope of this paper, we assume that the user is able to provide theorder for the nonterminals. We conjecture that if such an order actuallyexists, there exists ways to compute it that are more efficient than testingall possible orders. An approach using a graph of dependency between thenonterminals might be fruitful.
Figure 4.
Representation of a well-formed grammar ω : ∆ × V N × bool −→ bool ( ϵ , A , δ ) (cid:55)−→ ⊤([·] , A , δ ) (cid:55)−→ ⊤([ a ] , A , δ ) (cid:55)−→ ⊤( B , A , δ ) (cid:55)−→ δ ⇒ ( B < V N A )( e ; e , A , δ ) (cid:55)−→ ω ( e , A , δ )∧ ω ( e , A , δ ∧ P ( e ))( e / e , A , δ ) (cid:55)−→ ω ( e , A , δ ) ∧ ω ( e , A , δ )( e ∗ , A , δ ) (cid:55)−→ ω ( e , A , δ ) ∧ (¬ P ( e ))( ! e , A , δ ) (cid:55)−→ ω ( e , A , δ ) (8)The seq case is the most interesting: the new value for δ onthe right branch is δ ∧ P ( e ) . We check for pattern well-formedness in the right branch only if we are supposed todo so ( δ is true) and if the left branch can succeed withoutconsuming any input ( P ( e ) ). Definition 3.4 (Well-formed grammars) . We can define awell-formed set of grammars P exp as satisfying: ∀ A ∈ V N , ω ( P exp ( A ) , A , ⊤) (9)Such grammars ensure termination of the correspondingparser.In the PVS implementation, we define the ω function(named g_wf ), as well as a well-formed set of nonterminals type. In contrast with prior work [11, 15], we choose not to defineparsing as a relationship between an expression, a string anda result, but as a step of a parsing function that correspondsto the actual computation and that verifies a certain number Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA of properties. To do so, we choose to define an output typethat represents a computational path, that we call an ab-stract syntax tree as a parse tree that represent the full traceof the parse covering both successful and failed branches.This representation of the computational path as an explicitproof of correctness (soundness and completeness) for ourreference parser, makes it easier to observe and explain bothsuccess and failure. Unlike the TRX verified PEG parser [15],we choose to use this structure and to avoid defining parsingthrough rules. We believe that this approach is richer, asit provides explicit information on the computational path,and as it is easy to show that the rules of parsing are verifiedat each node of an Ast tree. An example of a lemma captur-ing a parsing rule that is verified by the parser is shown inFigure 8.
The implementation of those abstract syntax trees will needto verify a certain number of properties in order to representan actual computational path. But as we can define proper-ties only on existing objects, we start by creating a pre-Ast datatype, on which we will build the full
Ast type. As we canexpect, a pre-Ast depends on the type of terminals V T , thetype of non terminals V N , but also the upper bound of theinput b ∈ N . We have nine cases, that corresponds to nineconstructors. Each constructor has its own arguments, butalways requires s , e ∈ (cid:74) , b (cid:75) : the start and end of the stringthat was consumed by the subtree.1. skip ( s , e , G ) with G ∈ ∆ . This corresponds to all thecases where part of the grammar is skipped. For ex-ample, in the case of a prioritized choice, the secondbranch can be skipped if the first branch is already asuccess. For the moment, no condition is added on s and e , but later on we will obviously ask for e = s .We give the set of skips: S = { T ∈ A | ∃ ( s , e , G ) ∈ N × N × ∆ , T = skip ( s , e , G )} .2. ϵ ( s , e ) for the corresponding grammar node3. any ( s , e , x ) for the corresponding grammar node, thethe consumed character stored as x .4. terminal ( s , e , a , x ) for the corresponding grammar node,the expected character being a and the consumed onebeing x . In the success case, a = x but in the failurecase, we want to store exactly why it failed, so we store x (cid:44) a .5. nonT erminal ( s , e , A , T ) for the corresponding gram-mar node, with T (cid:60) S being the tree for parsing thenon terminal A .6. seq ( s , e , T , T ) with T (cid:60) S . T and T are supposed tocorrespond to the parsing of the sub-expressions e and e .7. prior ( s , e , T , T ) with T (cid:60) S . Same as seq . This can easily be replaced by a notion of end of stream character if theinput bound is unknown. star ( e , s , T , T s ) with T (cid:60) S . To keep the star structure,we ask for T s to be either of type star or skip . We define S as the set of star-like pre-Ast.9. notP ( e , s , T ) with T (cid:60) S .Here is the summary: Definition 4.1 (Pre-Ast) . We give the following inductivedefinition: A[ V T , V N , b ] :: = skip ( s , e , G ) , ( G ∈ ∆ )| ϵ ( s , e )| any ( s , e , x ) , ( x ∈ V T )| terminal ( s , e , a , x ) , ( a , x ∈ V T )| nonT erminal ( s , e , A , T ) , ( A ∈ V N )| seq ( s , e , T , T ) , ( T (cid:60) S)| prior ( s , e , T , T ) , ( T (cid:60) S)| star ( s , e , T , T s ) , (cid:18) T (cid:60) S , T s ∈ S ∪ S (cid:19) | notP ( s , e , T ) , ( T (cid:60) S) where, s , e ∈ (cid:74) , b (cid:75) As we define the pre-Ast, we see that we want to addextra-conditions on the arguments of the constructors. Butsome of those conditions rely on a notion of failure/success ,that we will formalize now.
Computing the failure or success requires a depth-first tra-versal of the tree. We consider three possible outcomes: {⊥ , ⊤ , u } with u standing for undefined and we call this the type of a pre-Ast. Here is a description of the function η thatcomputes the type:1. skip is always undefined: the failure/success shouldnot depend on a skip.2. ϵ is a success if e = s (meaning nothing was consumed).Otherwise it is undefined.3. any is a success if e = s +
1, a failure if e = s (meaningthe end of string was reached and no character wasconsumed) or undefined otherwise.4. terminal has the same conditions as any and adds the a = x condition.5. nonT erminal is of the same type as its subtree6. seq is a success if both subtrees are successes, and afailure if T fails or if T succeeds and T fails. All othercases are undefined.7. prior is a success if T is, or if T fails and T succeeds. If T and T fails, it is a failure, and undefined otherwise.8. star is always a success, as soon as T is not undefined.If T is a success, T s must be a success too (and thus isnot a skip) .9. notP is the opposite of the subtree type, where theopposite of undefined is also undefined. This corresponds to the fact that until the search for a pattern e fails, wekeep on searching (the star is greedy).7 PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar
Definition 4.2 (Meaningful tree) . A meaningful tree is apre-Ast tree that is either of type ⊥ or ⊤ . If we write M theset of meaningful trees, we have: M = η − ({⊥ , ⊤}) (10) Once we have a standalone notion of failure and success,we can add the other conditions to make sure the tree iswell-formed. Well-formed trees are basically the ones corre-sponding to a real computational path. Here is a summaryof those conditions:1. ϵ , any , terminal are well-formed if they are meaning-ful.2. nonterminal is well-formed if its subtree is and theirbounds are equals: e = e T and s = s T seq : We require T to be well-formed, and the boundsof T and T to be a partition of the bounds of T : s = s , e = s and e = e . If T is a failure, the second partof the grammar is not visited, so T must be a skip( T ∈ S) and it should not consume anything ( s = e ).4. prior : We require T to be well-formed, and both T and T should start at s ( s = s and s = s ). If T is a success,then T must be a non-consuming skip, and the endmust be the one of T ( e = e ). If T is a failure, the endbound must be the one of T , that is not allowed to bea skip ( e = e and T (cid:60) S )5. star : We require T to be well-formed, and the boundsshould be a partition (like with seq ). If T is a success, T s must be a star, and if T is a failure, T s must be anon-consuming skip.6. notP is well-formed if its subtree is. It should not con-sume anything, so the bounds should be equal: s = e .7. skip is always well-formed.We write the set of well-formed trees W . We have the fol-lowing result. Theorem 4.3 (Well-formed trees are meaningful) . A well-formed tree is either a tree of success or of failure:
W ⊂ M (11)In PVS, we define the astType? and astWellformed func-tions over the pre_ast type. We prove the theorem by in-duction.
Usually, as we mentioned and as in [11] and [15], the pars-ing is defined as a relationship between the inputs and theoutputs that verifies a certain number of derivation rulescorresponding to all the parsing cases. Lemmas that linkthe properties of the grammar to the parsing relationship Those lemmas are the following : (1) If the parsing fails, the grammar isof type P ⊥ (2) If the parsing succeed without consuming any input, thegrammar is of type P (3) If the parsing succeed consuming at least onetoken, the grammar is of type P > . are then proved to ensure that any algorithm following theparsing rules would terminate. We chose not to follow thisapproach, but to include all the conditions and the lemmasin the typing system of PVS. This has several advantages: • The typing system is verified by PVS, but once it isproven, it has no impact on the real computation. • We can rely on fewer axioms, as there is no need for aparsing relation-ship to be defined. All the axioms ofthe parsing relation-ship are proven properties of theparser, and such proofs are easy to do.To prove that well-formed trees actually correspond to parsingof a given input given a grammar, we introduce two notions.We say that a well-formed tree T is true to a grammar G if we can rebuild G from T . We say that a well-formed tree T is true to an input I if the characters stored in the tree atsome starting indices corresponds to the input. This notiononly applies to the part of the input covered by the tree -namely, between the start and the end indices of the tree .This yields three results that can be proved by induction (seeFigure 5 for the PVS code): Theorem 4.4 (Uniqueness for well-formed trees) . We have: • If a grammar G and a grammar G are both true to agiven well-formed tree T , then G = G . • If an input I and an input I are both true to a givenwell-formed tree T , and if the starting and ending indexesof the tree are ( s , e ) , then ∀ i ∈ (cid:74) s , e (cid:75) , I ( i ) = I ( i )• If two well-formed trees T and T are both true to thesame grammar G and to the same input I , and start atthe same point ( s T = s T ), then T = T This result is needed to prove that a parser is complete . Sincethe parser produces well-formed trees that are true to thegrammar and input given as arguments, we get that the re-sult tree is the only one possible. If we add the fact thatwell-formed trees are always either trees of success or of fail-ure, we get that parsing expression grammars always eithersucceed or fail, and the resulting tree is unique given thegrammar and input. In the PVS implementation, we definethe trueToGrammar and trueToInput functions and provethe uniqueness results. It is also possible to put less information into the type system and prove in-stead lemmas, but such proofs would need to be done by induction, coveringall the possible cases each time. When using the type system, the corre-sponding induction cases are actually split into the type-check conditions. Here, we are not considering the full set of the nonterminals, but onlya grammar node. If this node uses nonterminals, the check of the treerecursively goes into those nonterminals. This notion thus depends on agiven set of nonterminals P exp The full code of the functions trueToGrammar and trueToInput is notgiven here, at it is only very simple recursive checks. Basically, we do a caseanalysis on the constructors of the pre_ast datatype. The code can be seenon the github page.8
Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA % Grammar uniqueness :unique_grammar : LEMMA ∀ (T, G1 , G2 , P_exp ):( trueToGrammar (T, G1 , P_exp ) ∧ trueToGrammar (T, G2 , P_exp )) ⇒ (G1 = G2)% Input uniqueness :unique_input : LEMMA ∀ ((T: ast), inp1 , inp2 ):( trueToInput (T, inp1 ) ∧ trueToInput (T, inp2 )) ⇒ ( ∀ (i: below (e(T))): (i >= s(T)) ⇒ inp1 (i) =inp2 (i))% Main uniqueness theorem :unique_tree : LEMMA ∀ ((T1 , T2 : ast), inp , G,P_exp ):( trueToInput (T1 , inp) ∧ trueToInput (T2 , inp) ∧ trueToGrammar (T1 , G, P_exp ) ∧ trueToGrammar (T2 , G, P_exp ) ∧ s(T1) = s(T2)) ⇒ T1 = T2
Figure 5.
PVS implementation of the uniqueness results.
Now that we have a well-defined notion of computationalpath, along with well-formed grammars, we can define aparser generator function that is surprisingly simple.
The peд _ parser function is thought to be an interpreter forparsers of any PEG. It takes as input arguments : • P exp the interpretation of nonterminals. It must bewell-formed. • A ∈ V N the current nonterminal. • G ⊑ P exp ( A ) the current grammar node. • inp the input string, represented as an array of charac-ters • b the bound of the parsing, less or equal to the lengthof the input • s the starting index for the current node • s T the starting index when the parsing of the currentnonterminal started. We ask for s T ≥ s and if s = s T (which means nothing was consumed since the start ofthe current nonterminal), we must have G pattern well-formed . Indeed, as we saw, we allow subexpressionsof a given node P exp ( A ) to only be structurally well-formed only after consumption of a character.The output type captures a lot of the complexity of theparsing steps, ensuring mostly that trees consumed or failedcoherently with the grammar. The output type is the subsetof well-formed trees T that verifies the following conditions.(the implementation is given in figure 6). The three last condi-tions ensure that the actual result of the parsing correspondsto a property of the grammar. • s ( T ) = s • if G is a star , then T must be a star tree. • T (cid:60) S : T is not a skip. • T is true to the grammar G • T is true to the input I • ( η ( T ) = ⊤ ∧ e ( T ) = s ( T )) ⇒ P ( G )• ( η ( T ) = ⊤ ∧ e ( T ) > s ( T )) ⇒ P > ( G )• ( η ( T ) = ⊥) ⇒ P ⊥ ( G ) % Output type : bounds the tree and the grammaroutput (P_exp : WF_nT ,A: below ( V_N_b ),G: {e : ∆ | subterm (e, P_exp (A))},inp: input ,s: inp_bound ,s_T: {k : upto (s) | (k=s) ⇒ g_wf (G, A, P_0c ?(P_exp ), strong )}) :TYPE = {T : ast |% T starts at the intended position(s(T) = s) ∧ % T is true to the grammar and the input( trueToGrammar (T, G, P_exp )) ∧ ( trueToInput (T, inp)) ∧ % If the parsing succeeds without consumingsomething , G is of type P_0((( astType ?(T) = success ) ∧ (e(T) = s)) ⇒ P_0c?( P_exp )(G)) ∧ % If the parsing succeeds consuming something ,G is of type P_s((( astType ?(T) = success ) ∧ (e(T) > s)) ⇒ P_sc?( P_exp )(G)) ∧ % If the parsing fails , G is of type P_f(( astType ?(T) = failure ) ⇒ P_fc ?( P_exp )(G))}
Figure 6.
Output type definition. The output function pro-duces a type based on its arguments. The type is expressed bycomprehension over the ast type of well-formed grammars.Those properties ensure that at every step of the compu-tation, the properties needed for termination are preserved.Thanks to all the previous work on grammars and Ast, it issurprisingly simple (see Figure 7). The types contains mostof the valuable information.
The termination is proved by using a strictly decreasinglexicographic order on the 4-tuple : ( b − s T , b − s , A , | G |) At each step, we either :1. Go down the grammar, thus | G | decreases2. Use a strictly lower nonterminal, thus A decreases3. In the case of a star in the grammar, both the grammarnode, the current nonterminal and s T stay the same, PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar parsing (P_exp : WF_nT , % Set of nonTerminalsA: below ( V_N_b ), % Current non TerminalG: {e : ∆ | subterm (e, P_exp (A))}, % Grammar nodeinp: input , % Input arrays: inp_bound , % Starting index% Starting index at the beginning of the parsing of the current non terminal :s_T: {k : upto (s) | (k=s) ⇒ g_wf (G, A, P_0c ?( P_exp ), strong )}) :RECURSIVE output (P_exp , A, G, inp , s, s_T) =CASES G OF ϵ : ϵ (s,s),any: any(s, min(s+1, bound ), inp(s)),terminal (a): terminal (s, min(s+1, bound ), a, inp(s)),nonTerminal (B):let T_B = parsing (P_exp , B, P_exp (B), inp , s, s) innonTerminal (s, e(T_B), B, T_B),seq(e1 , e2):let T1= parsing (P_exp , A, e1 , inp , s, s_T) inif ( astType ?( T1) = failure ) then seq(s, e(T1), T1 , skip (e(T1), e(T1), e2))elselet T2 = parsing (P_exp , A, e2 , inp , e(T1), s_T) inseq(s, e(T2), T1 , T2)endif ,prior (e1 , e2):let T1 = parsing (P_exp , A, e1 , inp , s, s_T) inif ( astType ?( T1) = success ) then prior (s, e(T1), T1 , skip (s, s, e2))elselet T2 = parsing (P_exp , A, e2 , inp , s, s_T) inprior (s, e(T2), T1 , T2)endif ,star (e):let T0 = parsing (P_exp , A, e, inp , s, s_T) inif ( astType ?( T0) = failure ) then star (s, s, T0 , skip (e(T0), e(T0), star (e)))elselet Ts = parsing (P_exp , A, star (e), inp , e(T0), s_T) INstar (s, e(Ts), T0 , Ts)endif ,notP (e): let T = parsing (P_exp , A, e, inp , s, s_T) IN notP (s, s, T)ENDCASESMEASURE lex4 ( bound - s_T , bound - s,A, pegMeasure (G)) Figure 7.
Code for the parser interpreter. The
CASES...OF syntax is used to do a case analysis on the possible constructorsfor the PEG datatype. As for every recursive function in PVS, we provide the measure ensuring termination ( lex4 stands forlexical ordering of 4 elements).but s increases (because if e ∗ is well-formed, we have ¬ P ( e ) ).4. If we get to a nonTerminal node, and that nonTerminalis greater than the current one, the recursive call ismade with s ← s T . Here s must be strictly greater than s T , ensuring that at least one token was consumedbefore using greater nonTerminal). As we mentioned, this reference parser satisfies the rulesdefined by [11] by design. This was ensured by structureof well-formed ast-trees, but we can prove it a posteriori .It is almost always trivial, as expanding the definition ofthe parsing functions is enough to show the property. An example is shown in Figure 8. In the actual implementation,the full set of rules is proven.
As we can see in the output type, the produced trees are trueto the input and to the grammar. Thanks to the uniquenessresults, this ensures that the only proof of parse or proof offailure is that exists for such a pair of input and grammar isthe one found by the parser.
The natural extension of a PEG parser is to build a packratparser (see [10]) upon the reference parser. As we alreadyhave a reference parser, it is easy to build an efficient packrat Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA parsing_correctness_prior :LEMMA FORALL ( P_exp : WF_nT ,(A : V_N),(e1 ,e2 : {G : ∆ | subterm (G, P_exp (A))}) ,(inp : input ),(s : upto ( bound )),(s_T: {k : upto (s) | (k=s) ⇒ g_wf ( prior (e1 ,e2),A, P_0c ?( P_exp ), strong )})) :subterm ( prior (e1 , e2), P_exp (A)) ⇒ LET T1 = parsing (P_exp , A, e1 , inp , s, s_T)INLET T2 = parsing (P_exp , A, e2 , inp , s, s_T)INLET T = parsing (P_exp , A, prior (e1 ,e2), inp , s,s_T) IN(( astType ?( T1) = ⊤ ) ⇒ ( astType ?(T) = ⊤ ∧ e(T)= e(T1))) ∧ (( astType ?( T1) = ⊥ ) ⇒ ( astType ?(T) = astType ?(T2) ∧ e(T) = e(T2)))parsing_correctness_notP :LEMMA FORALL ( P_exp : WF_nT ,(A : V_N),(e : {G : ∆ | subterm (G, P_exp (A))}) ,(inp : input ),(s : upto ( bound )),(s_T: {k : upto (s) | (k=s) ⇒ g_wf ( notP (e), A,P_0c ?( P_exp ), strong )})) :subterm ( notP (e), P_exp (A)) ⇒ LET T_n = parsing (P_exp , A, e, inp , s, s_T)INLET T = parsing (P_exp , A, notP (e), inp , s, s_T)IN( astType ?( T_n) = ⊥ ⇒ ( astType ?(T) = ⊤ ∧ e(T) =s)) ∧ ( astType ?( T_n) = ⊤ ⇒ ( astType ?(T) = ⊥ )) Figure 8.
Example of lemmas that shows that the parserverifies the axioms of parsing. In the PVS implementation,the full set of axioms is proven.parser that only relies on the reference one (through thetyping system). The output tree can also be compacted onthe fly, removing useless failing branches, while ensuringthat the output is equivalent to the one of the reference parser.This allows us to use the reference parser as a ghost referencewhile never actually calling it. This whole section is directlyinspired by the work of Ford [10].
The key to transform the worst case exponential complexityof the parsing into a linear one comes from the followingobservations: • Searching for a pattern A at a starting position s is in-dependent from previously done parsing, it only relieson A and s . • There are at most n patterns we can search at most b starting positions We deduce that if we store the result of the parsing of non-terminals at given starting positions when we compute them(laziness), the parser will be called at most b × n times. As arecursive call is made in constant time , we end up with acomplexity bounded by O ( b × n ) . We use a PVS structure to store results as they are computed,as shown in Figure 9. The type of those objects is basedon the reference parser, ensuring that we only store resultsthat we could obtain by calling the parser with the sameparameters. We then modify the parsing function to returnboth the ast and the record of computed results. When weparse a nonterminal, the result is either already known (anddirectly returned), or we compute it and update the recordof results. This modification is shown in Figure 10. % Datatype used for memoization of partial parsessaved_result : DATATYPEBEGINunknown : unknown ?known (T : ast) : known ?END saved_result% Results type : ensures that every partial parsecorrespond to the reference parserresults (P_exp : WF_nT ,inp: input ) :TYPE ={r : [V_N -> [ inp_bound -> saved_result ]] |FORALL (A : V_N , s : upto ( bound )) :known ?(r(A)(s)) ⇒ T(r(A)(s)) = parsing (P_exp ,A, P_exp (A), inp , s, s)}
Figure 9.
The result structure to store intermediate results asthey are computed. A datatype with two constructors is used.The results function produces a type by comprehensionover the set of functions which take a nonterminal and astarting index and returns a stored result. Such a stored resultshould be equal to the reference parser called on the samearguments.
As we mentioned, the reference parser can be used as a ref-erence , stating that the results of the packrat parser are thesame as if the reference parser was called with the samearguments. This is illustrated in Figure 11. Again, those con-ditions are typing conditions, which means that once theyare proved through the typechecking condition system, theydo not impact the actual computation. In the reference parser implementation, the use of astType functionmakes it non-constant, but this can be easily changed by modifying theparser so it also outputs the type of the return tree. This is done in the11
PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar
CASES G OF...nonTerminal (B): (CASES res(B)(s) OFknown (T_B): ( nonTerminal (s, e(T_B), B, T_B),res),unknown :let (T_B , resB ) = packrat_parser (P_exp , B,P_exp (B), inp , s, s, res) in( nonTerminal (s, e(T_B), B, T_B), resB WITH[B := resB (B) WITH [s := known (T_B)]])ENDCASES ),...ENDCASES
Figure 10.
Parsing a nonterminal, checking if the result isalready known. If the result is unknown, we compute it andextend the function using the
WITH [.. := ..] syntax. packrat_parser (P_exp : WF_nT , A: below ( V_N_b ),G: {e : ∆ | subterm (e, P_exp (A))},inp: input , s: upto ( bound ),s_T: {k: upto (s) | (k=s) ⇒ g_wf (G, A, P_0c ?(P_exp ), strong )},res: results (P_exp , inp)) :RECURSIVE[{T : pre_ast | T = parsing (P_exp , A, G, inp , s, s_T)},results (P_exp , inp)] Figure 11.
The packrat parser returns a tree that is thesame as the reference parser, and a function that satisfies the result type.
In this section we introduce a modification of the parser thatallows the user to specify the data-structure of the parsedresult according to their specific needs. Indeed, the
AST wedefined are interesting as effective proofs of parse but inpractice, users are more interested in data-structures specificto their grammar and not to the underlying PEG operators.For example, an HTML parser should output a DOM tree andnot the cumbersome full AST tree of the parser generator. Inorder to modify the parser to be able to create those semantictrees we need to introduce modifications to the
AST type,specify how the user provides the constructors for their datastructure, and how we maintain the equivalence with thereference parser. compacted semantic trees of section 6, as failure trees are reduced to failurenodes, hence making typechecking of trees trivial. Especially since they provide a complete description of the computationalpath of the parser.
To allow the user to produce custom data-structures, weneed to introduce a new type for those data-structures : V S .In order to simplify the definitions, we do not distinguishbetween the subtypes the user may be using in the datastructure. We only consider that V S is the superset of alluser types. We call elements of V S semantic values. Whendesigning a grammar, the user creates nonterminals for eachpattern in the intended input. Therefore, the meaningful unitthat can be transformed into a semantic value is the result ofthe parsing of a nonterminal. To replace nonterminal AST s bytheir semantic value, we need to add a new constructor tothe
AST type to store such values. This is shown in Figure 12.Following the same idea, we also introduce a new failurenode to replace failing trees that are useless to the user. pre_ast : DATATYPE...% s : start index% e : end index% A : non terminal that was interpreted assemantic value% S : semantic value storedsemantic (s, e: below , A: V_N , S: V_S): semantic?% s : start index% e : end indexfail (s, e): fail ?END pre_ast
Figure 12.
New constructors for the
AST type. Those twonew constructors are used to compact the tree.
AST
Now that we can store semantic values (that correspond tononterminal subtrees), we can replace those trees by their computed semantic value . This transformation is called the semantic interpretation of a tree . It can be done on a fullycomputed
AST produced by the parser, or on the fly whileparsing . The user provides a function that does the trans-formation of an nonterminal node ( P inp : AST −→ V S ) andwe do a depth-first search in the tree to recursively replacethose nodes by their semantic value. To compact the tree, thesemantic interpretation also replaces failing tree by a simple fail node. The implementation is given in Figure 13. A treethat has no nonterminal nodes (but semantic ones) and nofailing branches (except for fail nodes) is called a semanticcompacted tree . A semantic parser will simply compute the Even though the function can compute semantic interpretations on fulltrees, the idea is to do it on the fly, and use the function on full tree only toprove equivalence with the reference parser. At runtime, in order to storethe smallest possible trees (that can get very big for complex grammars), wecompact subtrees as soon as possible by taking their semantic interpretations A few conditions are added to the output type for the sake of simplificationlater on in the parser.12
Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA % Function that transforms a tree into its semantic interpretation , recursivelys_inp (P_inp : semantic_interp ,T: pre_ast ):RECURSIVE{T ': semanticTree |( ( skip ?(T) ⇐⇒ skip ?(T ') ) ∧ ( star ?(T) ⇐⇒ star ?(T ') ) ∧ (( astType ?(T) = failure ) ⇐⇒ fail ?(T ') ) ∧ (s(T ') = s(T)) ∧ (e(T ') = e(T)) ∧ ( astWellformed ?(T) ⇒ ( astWellformed ?(T ') ∧ ( astType ?(T ') = astType ?(T))))} =IF ( astType ?(T) = failure ) THEN fail (s(T), e(T))ELSECASES T OF ϵ (s, e): T,any (s, e, x): T,terminal (s, e, x, y): T,nonTerminal (s, e, A, T): semantic (s, e, A, P_inp (A, s_inp (P_inp , T))),semantic (s, e, A, S): T,seq(s, e, T1 , T2): seq(s, e, s_inp (P_inp , T1), s_inp (P_inp , T2)),prior (s, e, T1 , T2): prior (s, e, s_inp (P_inp , T1), s_inp (P_inp , T2)),star (s, e, T0 , Ts): star (s, e, s_inp (P_inp , T0), s_inp (P_inp , Ts)),notP (s, e, T): notP (s, e, s_inp (P_inp , T)),skip (s, e, G): TENDCASESENDIFMEASURE astMeasure (T) Figure 13.
Recursive semantic interpretation of a tree.semantic interpretation of the nonterminal on the fly duringthe parsing. A representation of the semantic interpretationof a simple arithmetic expression is given in Figure 14.
AST
The notion of equivalence with the reference parser becomesmore subtle here, as compacted semantic trees are not equalto the trees produced by the reference parser. The main ideais the following: we want to ensure that when information iscondensed in a compacted tree, we get the same informationif we call the reference parser on the same input and thencompact the resulting tree. Here is the detailed definition ofequivalence:1. For all nodes that are not fail and semantic nodes, thereference and compacted trees should be the same.2. For a fail(s, e) node, we need to check that if we callthe reference parser on the same starting point, wewould get a failing tree ending on the same index e .3. For a semantic(s, e, A, S) node, we need to check thatthe semantic interpretation of the output of the refer-ence parser would be equal to the value stored here.These conditions on the output of the semantic parser in-terpreter can be captured directly in the typing system. Theoutput type of a semantic parser interpreter is the following(using the s_inp function). {T: pre_ast | T = s_inp (P_inp , parsing (P_exp , A,G, inp , b, s, s_T))} Example.
The PVS suite provides an interactive interface toexecute the verified code: PVSio [9]. Using this system, wecan test the parser on real examples. We crafted a very simplearithmetic expression parser for test purposes: the expressionis represented as an array of ascii values (1 + ∗( − / ) is rep-resented as the sequence [ , , , , , , , , , , ] with binary values). The interaction with the system is shownin Figure 15. We can see that the semantic parser does pro-duce a compacted tree that correspond to the reference one,and that modifying the input changes the result accordingly. Extracting data from input streams representing programs,text, documents, images, and video is a complex task. Parsersfor these data formats transform the input data streams intoactionable data while rejecting incorrect inputs. Parsing issupported by a rich body of theory, but it is also central topractice. Many software vulnerabilities arise from poorlydesigned grammars, ambiguous inputs, and incorrect pars-ing. Parsing expression grammars are a widely used class of PP ’20, January 20–21, 2020, New Orleans, LA, USA Clement Blaudeau and Natarajan Shankar
Figure 14.
Representation of the semantic interpretation ofsimple arithmetic expression. Numbers are parsed as lists ofdigits (grey boxes). % Parsing of the arithmetic expression1*2+(3 -4/5)% ( encoded in ascii : 49 42 50 43 40 51 45 52 4753 41)
Figure 15.
Example of using the semantic parser for simplearithmetic expressionsexpressive grammars that support efficient and unambigu-ous packrat parsing through memoization. We have usedPVS to formalize the metatheory of PEGs and derived acorrect parser interpreters for these grammars supportingmemoization and semantic actions. The proofs have been mechanically verified in PVS by taking advantage of the ex-pressiveness of the PVS type system. A significant part ofthe metatheory covers the analysis of grammars constructedfrom PEG expressions to check if a parse based on the expres-sion could fail, could succeed without consuming input, orcould succeed only by consuming input. This analysis is usedto establish the termination of a reference parser interpreterfor PEGs. We also define a parse tree representation that cap-tures the trace of the parser on an input. This representationserves as a evidence for the parsing of the input with respectto a given grammar. We establish certain uniqueness resultsthat demonstrate that there is exactly one grammar withrespect to parse tree for a given input, and exactly one parsetree for a given input with respect to a given grammar. Theseuniqueness results demonstrate that the reference parser iscomplete: it either returns a successful parse tree and thereis a unique such tree, or a failed parse tree and there is noother parse tree that parses the same input with success orfailure.In future work, we plan to conduct empirical studies ofthe performance of the parsers, explore the use of efficientdata structures for parsing, develop a systematic method-ology for the derivation of correct-by-construction parserinterpreters and generators for other grammar formats, andexperiment with the integration of the generated parsersinto security-critical applications. Our work is a preliminarystep toward powerful correct-by-construction parser inter-preters and generators for expressive grammars and dataformat descriptions.
References [1] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986.
Compilers:Principles, Techniques, and Tools . Addison-Wesley.[2] Aditi Barthwal and Michael Norrish. 2009. Verified, executable parsing.In
Programming Languages and Systems, 18th European Symposium onProgramming, ESOP 2009 (Lecture Notes in Computer Science) , GiuseppeCastagna (Ed.), Vol. 5502. Springer, 160–174.[3] Jean-Philippe Bernardy and Patrik Jansson. 2016. Certified context-free parsing: A formalisation of Valiant’s algorithm in Agda.
LogicalMethods in Computer Science
12, 2 (2016), 1–28. https://doi.org/10.2168/LMCS-12(2:6)2016 [4] Ana Bove, Peter Dybjer, and Ulf Norell. 2009. A brief overview of Agda- A functional language with dependent types. In
Theorem Provingin Higher Order Logics, 22nd International Conference, TPHOLs 2009,Munich, Germany, August 17-20, 2009. Proceedings (Lecture Notes inComputer Science) , Stefan Berghofer, Tobias Nipkow, Christian Urban,and Makarius Wenzel (Eds.), Vol. 5674. Springer, 73–78. https://doi.org/10.1007/978-3-642-03359-9_6 [5] Edwin Brady. 2013. Idris, a general-purpose dependently typed pro-gramming language: Design and implementation.
J. Funct. Program.
23, 5 (2013), 552–593. https://doi.org/10.1017/S095679681300018X [6] Sergey Bratus, Lars Hermerschmidt, Sven M. Hallberg, Michael E.Locasto, Falcon Momot, Meredith L. Patterson, and Anna Shubina.2017. Curing the vulnerable parser: Design patterns for secure inputhandling. ;login:
42, 1 (2017), 33–39.[7] William H. Burge. 1975.
Recursive Programming Techniques . Addison-Wesley, Reading, MA.14
Verified Packrat Parser Interpreter for Parsing Expression Grammars CPP ’20, January 20–21, 2020, New Orleans, LA, USA [8] Benjamin Delaware, Sorawit Suriyakarn, Clément Pit-Claudel,Qianchuan Ye, and Adam Chlipala. 2019. Narcissus: Correct-by-construction derivation of decoders and encoders from binary formats.
PACMPL
3, ICFP (2019), 82:1–82:29. https://doi.org/10.1145/3341686 [9] Aaron Dutle, César A. Muñoz, Anthony Narkawicz, and Ricky W.Butler. 2015. Software validation via model animation. In
Tests andProofs - 9th International Conference, TAP 2015, Held as Part of STAF2015, L’Aquila, Italy, July 22-24, 2015. Proceedings (Lecture Notes inComputer Science) , Jasmin Christian Blanchette and Nikolai Kosmatov(Eds.), Vol. 9154. Springer, 92–108. https://doi.org/10.1007/978-3-319-21215-9_6 [10] Bryan Ford. 2002. Packrat parsing: : Simple, powerful, lazy, lin-ear time, functional pearl. In
Proceedings of the Seventh ACM SIG-PLAN International Conference on Functional Programming (ICFP ’02),Pittsburgh, Pennsylvania, USA, October 4-6, 2002. , Mitchell Wand andSimon L. Peyton Jones (Eds.). ACM, New York, NY, USA, 36–47. https://doi.org/10.1145/581478.581483 [11] Bryan Ford. 2004. Parsing expression grammars: A recognition-basedsyntactic foundation. In
Proceedings of the 31st ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages, POPL 2004, Venice,Italy, January 14-16, 2004 , Neil D. Jones and Xavier Leroy (Eds.). ACM,111–122. https://doi.org/10.1145/964001.964011 [12] Richard A. Frost and John Launchbury. 1989. Constructing naturallanguage interpreters in a lazy functional language.
Comput. J.
32, 2(1989), 108–121. https://doi.org/10.1093/comjnl/32.2.108 [13] Graham Hutton. 1989. Parsing Using combinators. In
Functional Pro-gramming, Proceedings of the 1989 Glasgow Workshop, 21-23 August1989, Fraserburgh, Scotland, UK (Workshops in Computing) , Kei Davisand John Hughes (Eds.). Springer, 353–370.[14] Jacques-Henri Jourdan, François Pottier, and Xavier Leroy. 2012. Val-idating LR(1) parsers. In
Programming Languages and Systems - 21stEuropean Symposium on Programming, ESOP 2012, Held as Part of theEuropean Joint Conferences on Theory and Practice of Software, ETAPS2012, Tallinn, Estonia, March 24 - April 1, 2012. Proceedings (LectureNotes in Computer Science) , Helmut Seidl (Ed.), Vol. 7211. Springer,397–416.[15] Adam Koprowski and Henri Binsztok. 2011. TRX: A formally verifiedparser interpreter.
Logical Methods in Computer Science
7, 2 (2011),1–26. https://doi.org/10.2168/LMCS-7(2:18)2011 [16] Ramana Kumar, Magnus O. Myreen, Michael Norrish, and ScottOwens. 2014. CakeML: A verified implementation of ML. In
The41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Pro-gramming Languages, POPL ’14, San Diego, CA, USA, January 20-21,2014 , Suresh Jagannathan and Peter Sewell (Eds.). ACM, 179–192. https://doi.org/10.1145/2535838.2535841 [17] Sam Lasser, Chris Casinghino, Kathleen Fisher, and Cody Roux. 2019.A verified LL(1) parser generator. In , John Harrison, John O’Leary, and Andrew Tolmach(Eds.), Vol. 141. Schloss Dagstuhl - Leibniz-Zentrum für Informatik,24:1–24:18. https://doi.org/10.4230/LIPIcs.ITP.2019.24 [18] Xavier Leroy. 2009. Formal verification of a realistic compiler.
Com-mun. ACM
52, 7 (2009), 107–115. http://doi.acm.org/10.1145/1538788. 1538814 [19] Raul Lopes, Rodrigo Geraldo Ribeiro, and Carlos Camarão. 2016. Cer-tified derivative-based parsing of regular expressions. In
ProgrammingLanguages - 20th Brazilian Symposium, SBLP 2016, Maringá, Brazil,September 22-23, 2016, Proceedings (Lecture Notes in Computer Science) ,Fernando Castor and Yu David Liu (Eds.), Vol. 9889. Springer, 95–109. https://doi.org/10.1007/978-3-319-45279-1_7 [20] Greg Morrisett, Gang Tan, Joseph Tassarotti, Jean-Baptiste Tristan,and Edward Gan. 2012. RockSalt: Better, faster, stronger SFI for thex86. In
ACM SIGPLAN Conference on Programming Language Designand Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012 , JanVitek, Haibo Lin, and Frank Tip (Eds.). ACM, 395–404. https://doi.org/10.1145/2254064.2254111 [21] Sam Owre, John Rushby, Natarajan Shankar, and Friedrich von Henke.1995. Formal Verification for Fault-Tolerant Architectures: Prolegom-ena to the Design of PVS. 21, 2 (Feb. 1995), 107–125. PVS home page: http://pvs.csl.sri.com .[22] Tahina Ramananandro, Antoine Delignat-Lavaud, Cédric Fournet,Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko.2019. EverParse: Verified secure zero-copy parsers for authen-ticated message formats. In ,Nadia Heninger and Patrick Traynor (Eds.). USENIX Association,1465–1482. [23] Tom Ridge. 2011. Simple, Functional, Sound and Complete Parsingfor All Context-Free Grammars. In
Certified Programs and Proofs -First International Conference, CPP 2011, Kenting, Taiwan, December7-9, 2011. Proceedings (Lecture Notes in Computer Science) , Jean-PierreJouannaud and Zhong Shao (Eds.), Vol. 7086. Springer, 103–118.[24] Gang Tan and Greg Morrisett. 2018. Bidirectional grammars formachine-code decoding and encoding.
J. Autom. Reasoning
60, 3(2018), 257–277. https://doi.org/10.1007/s10817-017-9429-1 [25] Mark Tullsen, Lee Pike, Nathan Collins, and Aaron Tomb. 2018. For-mal verification of a Vehicle-to-Vehicle (V2V) messaging system. In
Computer Aided Verification - 30th International Conference, CAV 2018,Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK,July 14-17, 2018, Proceedings, Part II (Lecture Notes in Computer Science) ,Hana Chockler and Georg Weissenbacher (Eds.), Vol. 10982. Springer,413–429. https://doi.org/10.1007/978-3-319-96142-2_25 [26] Marcell van Geest and Wouter Swierstra. 2017. Generic packet de-scriptions: verified parsing and pretty printing of low-level data.In
Proceedings of the 2nd ACM SIGPLAN International Workshop onType-Driven Development, TyDe@ICFP 2017, Oxford, UK, September3, 2017 , Sam Lindley and Brent A. Yorgey (Eds.). ACM, 30–40. https://doi.org/10.1145/3122975.3122979 [27] Qianchuan Ye and Benjamin Delaware. 2019. A verified protocolbuffer compiler. In
Proceedings of the 8th ACM SIGPLAN InternationalConference on Certified Programs and Proofs, CPP 2019, Cascais, Portugal,January 14-15, 2019 , Assia Mahboubi and Magnus O. Myreen (Eds.).ACM, 222–233. https://doi.org/10.1145/3293880.3294105https://doi.org/10.1145/3293880.3294105