[PDF] Synthesizing Context-free Grammars from Recurrent Neural Networks (Extended Version)

Abstract

We present an algorithm for extracting a subclass of the context free grammars (CFGs) from a trained recurrent neural network (RNN). We develop a new framework, pattern rule sets (PRSs), which describe sequences of deterministic finite automata (DFAs) that approximate a non-regular language. We present an algorithm for recovering the PRS behind a sequence of such automata, and apply it to the sequences of automata extracted from trained RNNs using the L* algorithm. We then show how the PRS may converted into a CFG, enabling a familiar and useful presentation of the learned language. Extracting the learned language of an RNN is important to facilitate understanding of the RNN and to verify its correctness. Furthermore, the extracted CFG can augment the RNN in classifying correct sentences, as the RNN's predictive accuracy decreases when the recursion depth and distance between matching delimiters of its input sequences increases.

Full PDF

SSynthesizing Context-free Grammars fromRecurrent Neural Networks

Extended Version (cid:63)

Daniel M. Yellin and Gail Weiss IBM, Givatayim, Israel [email protected] Technion, Haifa, Israel [email protected]

Abstract.

We present an algorithm for extracting a subclass of thecontext free grammars (CFGs) from a trained recurrent neural network(RNN). We develop a new framework, pattern rule sets (PRSs), whichdescribe sequences of deterministic ﬁnite automata (DFAs) that approxi-mate a non-regular language. We present an algorithm for recovering thePRS behind a sequence of such automata, and apply it to the sequencesof automata extracted from trained RNNs using the L ∗ algorithm. Wethen show how the PRS may converted into a CFG, enabling a familiarand useful presentation of the learned language.Extracting the learned language of an RNN is important to facilitateunderstanding of the RNN and to verify its correctness. Furthermore, theextracted CFG can augment the RNN in classifying correct sentences, asthe RNN’s predictive accuracy decreases when the recursion depth anddistance between matching delimiters of its input sequences increases. Keywords:

Model Extraction · Learning Context Free Grammars · Finite State Machines · Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a class of neural networks adapted tosequential input, enjoying wide use in a variety of sequence processing tasks. Theirinternal process is opaque, prompting several works into extracting interpretablerules from them. Existing works focus on the extraction of deterministic orweighted ﬁnite automata (DFAs and WFAs) from trained RNNs [19,6,27,3].However, DFAs are insuﬃcient to fully capture the behavior of RNNs, whichare known to be theoretically Turing-complete [21], and for which there existarchitecture variants such as LSTMs [14] and features such as stacks [9,24]or attention [4] increasing their practical power. Several recent investigations (cid:63)

This is an extended version of a paper that will appear in the 27th InternationalConf on Tools and Algorithms for the Construction and Analysis of Systems (TACAS2021) a r X i v : . [ c s . F L ] F e b D.M. Yellin G. Weiss

RNN trained on(hidden) target L Seq of DFAs converging to L

PRS Rules

Inferred CFG for LGenerateDFAs PRSInference+ Filtering Convert

Fig. 1.

Overview of steps in algorithm to synthesize the hidden language L explore the ability of diﬀerent RNN architectures to learn Dyck, counter, andother non-regular languages [20,5,28,22], with mixed results. While the dataindicates that RNNs can generalize and achieve high accuracy, they do notlearn hierarchical rules, and generalization deteriorates as the distance or depthbetween matching delimiters becomes dramatically larger[20,5,28]. Sennhauserand Berwick conjecture that “what the LSTM has in fact acquired is sequentialstatistical approximation to this solution” instead of “the ‘perfect’ rule-basedsolution” [20]. Similarly, Yu et. al. conclude that “the RNNs can not truly modelCFGs, even when powered by the attention mechanism”. Goal of this paper

We wish to extract a CFG from a trained RNN. Our motivationis two-fold: ﬁrst, extracting a CFG from the RNN is important to facilitateunderstanding of the RNN and to verify its correctness. Second, the learned CFGmay be used to augment or generalise the rules learned by the RNN, whose ownpredictive ability decreases as the depth of nested structures and distance betweenmatching constructs in the input sequences increases [5,20,28]. Our techniquecan synthesize the CFG based upon training data with relatively short distanceand small depth. As pointed out in [13], a ﬁxed precision RNN can only learna language of ﬁxed depth strings (in contrast to an idealized inﬁnite precisionRNN that can recognize any Dyck language[16]). Our goal is to ﬁnd the CFGthat not only explains the ﬁnite language learnt by the RNN, but generalizes itto strings of unbounded depth and distance.

Our approach

Our method builds on the DFA extraction work of Weiss et al.[27], which uses the L ∗ algorithm [2] to learn the DFA of a given RNN. The L ∗ algorithm operates by generating a sequence of DFAs, each one a hypothesisfor the target language, and interacting with a teacher, in our case the RNN, toimprove them. Our main insight is that we can view these DFAs as increasinglyaccurate approximations of the target CFL. We assume that each hypothesisimproves on its predecessor by applying an unknown rule that recursively increasesthe distance and embedded depth of sentences accepted by the underlying CFL.In this light, synthesizing the CFG responsible for the language learnt by theRNN becomes the problem of recovering these rules. A signiﬁcant issue we mustalso address is that the DFAs produced are often inexact or not as we expect,either due to the failure of the RNN to accurately learn the language, or as anartifact of the L ∗ algorithm. ynthesizing Context-free Grammars from Recurrent Neural Networks 3 We propose the framework of pattern rule sets (PRSs) for describing suchrule applications, and present an algorithm for recovering a PRS from a sequenceof DFAs. We also provide a method for converting a PRS to a CFG, translatingour extracted rules into familiar territory. We test our method on RNNs trainedon several PRS languages.Pattern rule sets are expressive enough to cover several variants of theDyck languages, which are prototypical CFLs: the Chomsky–Sch¨utzenbergerrepresentation theorem shows that any context-free language can be expressed asa homomorphic image of a Dyck language intersected with a regular language[17].To the best of our knowledge, this is the ﬁrst work on synthesizing a CFGfrom a general RNN . Contributions

The main contributions of this paper are: – Pattern Rule Sets (PRSs), a framework for describing a sequence of DFAsapproximating a CFL. – An algorithm for recovering the PRS generating a sequence of DFAs, thatmay also be applied to noisy DFAs elicited from an RNN using L ∗ . – An algorithm converting a PRS to a CFG. – An implementation of our technique, and an evaluation of its success onrecovering various CFLs from trained RNNs. The overall steps in our technique are given in Figure 1. The rest of this paper isas follows. Section 2 provides basic deﬁnitions used in the paper, and Section 3introduces

Patterns , a restricted form of DFAs. Section 4 deﬁnes

Pattern RuleSets (PRS) , the main construct of our research. Section 5 gives an algorithmto recover a PRS from a sequence of DFAs, even in the presence of noise, andSection 6 gives an algorithm to convert a PRS into a CFG. Section 7 presentsour experimental results, Section 8 discusses related research and Section 9outlines directions for future research. Appendices B and C provide proofs of thecorrectness of the algorithms given in the paper, as well results relating to theexpressibility of a PRS.

A deterministic ﬁnite au-tomaton (DFA) over an alphabet Σ is a 5-tuple (cid:104) Σ, q , Q, F, δ (cid:105) such that Q is aﬁnite set of states, q ∈ Q is the initial state, F ⊆ Q is a set of ﬁnal (accepting)states and δ : Q × Σ → Q is a (possibly partial) transition function. Though some works extract push-down automata [24,9] from RNNs with an externalstack (Sec. 8), they do not apply to plain RNNs. The implementation for this paper, and a link to all trained RNNs, is available at https://github.com/tech-srl/RNN to PRS CFG . D.M. Yellin G. Weiss

Unless stated otherwise, we assume each DFA’s states are unique to itself, i.e.,for any two DFAs

A, B – including two instances of the same DFA – Q A ∩ Q B = ∅ .A DFA A is said to be complete if δ is complete, i.e., the value δ ( q, σ ) is deﬁnedfor every q, σ ∈ Q × Σ . Otherwise, it is incomplete .We deﬁne the extended transition function ˆ δ : Q × Σ ∗ → Q and the language L ( A ) accepted by A in the typical fashion. We also associate a language withintermediate states of A : L ( A, q , q ) (cid:44) { w ∈ Σ ∗ | ˆ δ ( q , w ) = q } . The statesfrom which no sequence w ∈ Σ ∗ is accepted are known as the sink reject states . Deﬁnition 2.

The sink reject states of a DFA A = (cid:104) Σ, q , Q, F, δ (cid:105) are themaximal set Q R ⊆ Q satisfying: Q R ∩ F = ∅ , and for every q ∈ Q R and σ ∈ Σ ,either δ ( q, σ ) ∈ Q R or δ ( q, σ ) is not deﬁned.Incomplete DFAs are partial representations of complete DFAs, where everyunspeciﬁed transition is shorthand for a transition to a sink reject state. Alldeﬁnitions for complete DFAs are extended to incomplete DFAs A by consideringtheir completion : the DFA A C obtained by connecting a (possibly new) sinkreject state to all its missing transitions. For each DFA, we take note of thetransitions which cannot be removed even in its partial representations. Deﬁnition 3 (Deﬁned Tokens).

Let A = (cid:104) Σ, q , Q, F, δ (cid:105) be a complete DFAwith sink reject states Q R . For every q ∈ Q , its deﬁned tokens are def( A, q ) (cid:44) { σ ∈ Σ | δ ( q, σ ) / ∈ Q R } . When the DFA A is clear from context, we write def( q ) . We now introduce terminology that will help us discuss merging automatastates.

Deﬁnition 4 (Set Representation of δ ). A (possibly partial) transition func-tion δ : Q × Σ → Q may be equivalently deﬁned as the set S δ = { ( q, σ, q (cid:48) ) | δ ( q, σ ) = q (cid:48) } . We use δ and S δ interchangeably. Deﬁnition 5 (Replacing a State).

For a transition function δ : Q × Σ → Q ,state q ∈ Q , and new state q n / ∈ Q , we denote by δ [ q ← q n ] : Q (cid:48) × Σ → Q (cid:48) thetransition function over Q (cid:48) = ( Q \ { q } ) ∪ { q n } and Σ that is identical to δ exceptthat it redirects all transitions into or out of q to be into or out of q n . A Dyck language of order N is expressed by the grammar D ::= ε | L i D R i |D D with start symbol D , where for each 1 ≤ i ≤ N , L i and R i are matching leftand right delimiters. A common methodology for measuring the complexity of aDyck word is to measure its maximum distance (number of characters) betweenmatching delimiters and embedded depth (number of unclosed delimiters) [20].While L i and R i are single characters in a Dyck language, we generalize andrefer to Regular Expression Dyck (RE-Dyck) languages as languages expressed bythe same CFG, except that each L i and each R i derive some regular expression. ynthesizing Context-free Grammars from Recurrent Neural Networks 5 Regular Expressions:

We present regular expressions as is standard, for example: { a | b }· c refers to the language consisting of one of a or b , followed by c . Patterns are DFAs with a single exit state q X in place of a set of ﬁnal states,and with no cycles on their initial or exit states unless q = q X . In this paperwe express patterns in incomplete representation, i.e., they have no explicitsink-reject states. Deﬁnition 6 (Patterns). A pattern p = (cid:104) Σ, q , Q, q X , δ (cid:105) is a DFA A p = (cid:104) Σ, q , Q, { q X } , δ (cid:105) satisfying: L ( A p ) (cid:54) = ∅ , and either q = q X , or def( q X ) = ∅ and L ( A, q , q ) = { ε } . If q = q X then p is called circular , otherwise, it is non-circular . Note that our deﬁnition does not rule out a cycle in the middle of an non-circular pattern but only one that traverses the initial or ﬁnal states.All the deﬁnitions for DFAs apply to patterns through A p . We denote eachpattern p ’s language L p (cid:44) L ( p ), and if it is marked by some superscript i , werefer to all of its components with superscript i : p i = (cid:104) Σ, q i , Q i , q iX , δ i (cid:105) . We can compose two non-circular patterns p , p by merging the exit state of p with the initial state of p , creating a new pattern p satisfying L p = L p · L p . Deﬁnition 7 (Serial Composition).

Let p , p be two non-circular patterns.Their serial composite is the pattern p ◦ p = (cid:104) Σ, q , Q, q X , δ (cid:105) in which Q = Q ∪ Q \ { q X } and δ = δ q X ← q ] ∪ δ . We call q the join state of this operation. If we additionally merge the exit state of p with the initial state of p , weobtain a circular pattern p which we call the circular composition of p and p .This composition satisﬁes L p = { L p · L p } ∗ . Deﬁnition 8 (Circular Composition).

Let p , p be two non-circular pat-terns. Their circular composite is the circular pattern p ◦ c p = (cid:104) Σ, q , Q, q , δ (cid:105) in which Q = Q ∪ Q \ { q X , q X } and δ = δ q X ← q ] ∪ δ q X ← q ] . We call q the join state of this operation. Figure 2 shows 3 examples of serial and circular compositions of patterns.Patterns do not carry information about whether or not they have beencomposed from other patterns. We maintain such information using pattern pairs . Deﬁnition 9 (Pattern Pair). A pattern pair is a pair (cid:104) P, P c (cid:105) of pattern sets,such that P c ⊂ P and for every p ∈ P c there exists exactly one pair p , p ∈ P satisfying p = p (cid:12) p for some (cid:12) ∈ {◦ , ◦ c } . We refer to the patterns p ∈ P c asthe composite patterns of (cid:104) P, P c (cid:105) , and to the rest as its base patterns . D.M. Yellin G. Weiss a b a b( ) () p1 p2 p1 ⊙ p2 (i)Serial ￮ (iii)Serial ￮ (ii) Cyclic ￮ c xb y Legend: initial state exit state join statea xa yz a xa y xb yz

Fig. 2.

Examples of the composition operator

Every instance ˆ p of a pattern p in a DFA A is uniquely deﬁned by p , A , andˆ p ’s initial state in A . If p is a composite pattern with respect to some patternpair (cid:104) P, P c (cid:105) , the join state of its composition within A is also uniquely deﬁned. Deﬁnition 10 (Pattern Instances).

Let A = (cid:104) Σ, q A , Q A , F, δ A (cid:105) be a DFA, p = (cid:104) Σ, q , Q, q X , δ (cid:105) be a pattern, and ˆ p = (cid:104) Σ, q (cid:48) , Q (cid:48) , q (cid:48) X , δ (cid:48) (cid:105) be a pattern ‘inside’ A , i.e., Q (cid:48) ⊆ Q A and δ (cid:48) ⊆ δ A . We say that ˆ p is an instance of p in A if ˆ p isisomorphic to p . A pattern instance ˆ p in a DFA A is uniquely determined by its structure andinitial state: ( p, q ). Deﬁnition 11.

For every pattern pair (cid:104)

P, P c (cid:105) we deﬁne the function join asfollows: for each composite pattern p ∈ P c , DFA A , and initial state q of aninstance ˆ p of p in A , join( p, q, A ) returns the join state of ˆ p with respect to itscomposition in (cid:104) P, P c (cid:105) . For any inﬁnite sequence S = A , A , ... of DFAs satisfying L ( A i ) ⊂ L ( A i +1 ),for all i , we deﬁne the language of S as the union of the languages of all theseDFAs: L ( S ) = ∪ i L ( A i ). Such sequences may be used to express CFLs such asthe language L = { a n b n | n ∈ N } and the Dyck language of order N.In this work we take a ﬁnite sequence A , A , ..., A n of DFAs, and assume itis a (possibly noisy) ﬁnite preﬁx of an inﬁnite sequence of approximations for alanguage, as above. We attempt to reconstruct the language by guessing how thesequence may continue. To allow such generalization, we must make assumptionsabout how the sequence is generated. For this we introduce pattern rule sets .Pattern rule sets (PRSs) create sequences of DFAs with a single acceptingstate. Each PRS is built around a pattern pair (cid:104) P, P c (cid:105) , and each rule application ynthesizing Context-free Grammars from Recurrent Neural Networks 7 involves the connection of a new pattern instance to the current DFA A i , atthe join state of a composite-pattern inserted whole at some earlier point in theDFA’s creation. In order to deﬁne where a pattern can be inserted into a DFA,we introduce an enabled instance set I . Deﬁnition 12. An enabled DFA over a pattern pair (cid:104) P, P c (cid:105) is a tuple (cid:104) A, I(cid:105) such that A = (cid:104) Σ, q , Q, F, δ (cid:105) is a DFA and I ⊆ P c × Q marks enabled instances of composite patterns in A . Intuitively, for every enabled DFA (cid:104) A, I(cid:105) and ( p, q ) ∈ I , we know: (i) there isan instance of pattern p in A starting at state q , and (ii) this instance is enabled ;i.e., we may connect new pattern instances to its join state join( p, q, A ).We now formally deﬁne pattern rule sets and how they are applied to createenabled DFAs, and so sequences of DFAs. Deﬁnition 13.

A PRS P is a tuple (cid:104) Σ, P, P c , R (cid:105) where (cid:104) P, P c (cid:105) is a pattern pairover the alphabet Σ and R is a set of rules . Each rule has one of the followingforms, for some p, p , p , p , p I ∈ P , with p and p non-circular: (1) ⊥ (cid:16) p I (2) p (cid:16) c ( p (cid:12) p ) ◦ = p , where p = p (cid:12) p for (cid:12) ∈ {◦ , ◦ c } , and p is circular (3) p (cid:16) s ( p ◦ p ) ◦ = p , where p = p ◦ p and p is non-circular A PRS is used to derive sequences of enabled DFAs as follows: ﬁrst, a rule oftype (1) is used to create an initial enabled DFA D = (cid:104) A , I (cid:105) . Then, for any (cid:104) A i , I i (cid:105) , each of the rule types deﬁne options to graft new pattern instances ontostates in A i , with I i determining which states are eligible to be expanded in thisway. The ﬁrst DFA is simply the p I from a rule of type (1). If p I is composite,then it is also enabled. Deﬁnition 14 (Initial Composition). D = (cid:104) A , I (cid:105) is generated from a rule ⊥ (cid:16) p I as follows: A = A p I , and I i = { ( p I , q I ) } if p I ∈ P c and otherwise I = ∅ . Let D i = (cid:104) A i , I i (cid:105) be an enabled DFA generated from some given PRS P = (cid:104) Σ, P, P c , R (cid:105) , and denote A i = (cid:104) Σ, q , Q, F, δ (cid:105) . Note that for A , | F | = 1,and we will see that F is unchanged by all further rule applications. Hence wedenote F = { q f } for all A i .Rules of type (1) extend A i by grafting a circular pattern to q , and thenenabling that pattern if it is composite. Deﬁnition 15 (Rules of type (1) ). A rule ⊥ (cid:16) p I with circular p I may extend (cid:104) A i , I i (cid:105) at the initial state q of A i iﬀ def( q ) ∩ def( q I ) = ∅ . This creates the DFA A i +1 = (cid:104) Σ, q , Q ∪ Q I \{ q I } , F, δ ∪ δ I [ q I ← q ] (cid:105) . If p I ∈ P c then I i +1 = I i ∪{ ( p I , q ) } ,else I i +1 = I i . Rules of type (2) graft a circular pattern p = (cid:104) Σ, q , q x , F, δ (cid:105) onto the joinstate q j of an enabled pattern instance ˆ p in A i , by merging q with q j . In doingso, they also enable the patterns composing ˆ p , provided they themselves arecomposite patterns. D.M. Yellin G. Weiss p1 ○ p2 ↠ c (p1 ○ p2) p3 (i) p1 p2 p1 p2p3p3p3p1p2 p1p2 p3 (ii) p1 ○ c p2 ↠ c (p1 ○ c p2) p3 (iii) p1 p2 pxc p1 c pxxp3 cxp3 p1 ○ p2 ↠ s (p1 ○ p2) p3 p1 c pxr cp2 x p1 c pxr cp3 (iv) p3 Legend: initial state exit state join state transitions added to successor DFAtransitions in original DFA that are not part of p1 ○ p2 ○○○

Fig. 3.

Structure of DFA after applying rule of type 2 or type 3

Deﬁnition 16 (Rules of type (2) ). A rule p (cid:16) c ( p (cid:12) p ) ◦ = p may extend (cid:104) A i , I i (cid:105) at the join state q j = join( p, q, A i ) of any instance ( p, q ) ∈ I i , provided def( q j ) ∩ def( q ) = ∅ . This creates (cid:104) A i +1 , I i +1 (cid:105) as follows: A i +1 = (cid:104) Σ, q , Q ∪ Q \ q , F, δ ∪ δ q ← q j ] (cid:105) , and I i +1 = I i ∪ { ( p k , q k ) | p k ∈ P c , k ∈ { , , }} , where q = q and q = q = q j . For an application of r = p (cid:16) c ( p (cid:12) p ) ◦ = p , consider the languages L L and L R leading into and ‘back from’ the considered instance ( p, q ): L L = L ( A i , q , q )and L R = L ( A i , q ( p,q ) X , q f ), where q ( p,q ) X is the exit state of ( p, q ). Where L L · L p · L R ⊆ L ( A i ), then now also L L · L p · L p · L p · L R ⊆ L ( A i +1 ) (and moreover, L L · ( L p · L p · L p ) ∗ · L R ⊆ L ( A i +1 ) if p is circular). Example applications ofrule (2) are shown in Figures 3(i) and 3(ii).For non-circular patterns we also wish to insert an optional L p between L p and L p , but this time we must avoid connecting the exit state q X to q j lest weloop over p multiple times. We therefore duplicate the outgoing transitions of q j in p ◦ p to the inserted state q X so that they may act as the connections backinto the main DFA. Deﬁnition 17 (Rules of type (3) ). A rule p (cid:16) s ( p ◦ p ) ◦ = p may extend (cid:104) A i , I i (cid:105) at the join state q j = join( p, q, Ai ) of any instance ( p, q ) ∈ I i , provided def( q j ) ∩ def( q ) = ∅ . This creates (cid:104) A i +1 , I i +1 (cid:105) as follows: A i +1 = (cid:104) Σ, q , Q ∪ ynthesizing Context-free Grammars from Recurrent Neural Networks 9 Q \ q , F, δ ∪ δ q ← q j ] ∪ C (cid:105) where C = { ( q X , σ, δ ( q j , σ )) | σ ∈ def( p , q ) } , and I i +1 = I i ∪ { ( p k , q k ) | p k ∈ P c , k ∈ { , , }} where q = q and q = q = q j . We call the set C connecting transitions . This application of this rule isdepicted in Diagram (iii) of Figure 3, where the transition labeled ‘c’ in thisDiagram is a member of C from our deﬁnition.Multiple applications of rules of type (3) to the same instance ˆ p will createseveral equivalent states in the resulting DFAs, as all of their exit states willhave the same connecting transitions. These states are merged in a minimizedrepresentation, as depicted in Diagram (iv) of Figure 3.We now formally deﬁne the language deﬁned by a PRS. This is the languagethat we will assume a given ﬁnite sequence of DFAs is trying to approximate. Deﬁnition 18 (DFAs Generated by a PRS).

We say that a PRS P gener-ates a DFA A , denoted A ∈ G ( P ) , if there exists a ﬁnite sequence of enabled DFAs (cid:104) A , I (cid:105) , ..., (cid:104) A i , I i (cid:105) obtained only by applying rules from P , for which A = A i . Deﬁnition 19 (Language of a PRS).

The language of a PRS P is the unionof the languages of the DFAs it can generate: L ( P ) = ∪ A ∈ G ( P ) L ( A ) . EXAMPLE 1:

Let p and p be the patterns accepting ‘a’ and ‘b’ respectively.Consider the rule set R ab with two rules, ⊥ (cid:16) p ◦ p and p ◦ p (cid:16) s ( p ◦ p ) ◦ = ( p ◦ p ). This rule set creates only one sequence of DFAs. Once the ﬁrstrule creates the initial DFA, by continuously applying the second rule, we obtainthe inﬁnite sequence of DFAs each satisfying L ( A i ) = { a j b j : 1 ≤ j ≤ i } , andso L ( R ab ) = { a i b i : i > } . Figure 2(i) presents A , while A and A appear inFigure 4(i). Note that we can substitute any non-circular patterns for p and p ,creating the language { x i y i : i > } for any pair of non-circular pattern regularexpressions x and y . Fig. 4.

DFAs sequences for R ab and R Dyck EXAMPLE 2:

Let p , p , p , and p be the non-circular patterns accepting ‘(’,‘)’, ‘[’, and ‘]’ respectively. Let p = p ◦ c p and p = p ◦ c p . Let R Dyck be thePRS containing rules ⊥ (cid:16) p , ⊥ (cid:16) p , p (cid:16) c ( p ◦ c p ) ◦ = p , p (cid:16) c ( p ◦ c p ) ◦ = p , p (cid:16) c ( p ◦ c p ) ◦ = p , and p (cid:16) c ( p ◦ c p ) ◦ = p . R Dyck deﬁnes the Dyck languageof order 2. Figure 4 (ii) shows one of its possible DFA-sequences. EXAMPLE 3:

Let p and p be the patterns that accept the characters “0”and “1” respectively, p = p ◦ p and p = p ◦ p . Let R pal consist of therules ⊥ (cid:16) p , ⊥ (cid:16) p , p (cid:16) s ( p ◦ p ) ◦ = p , p (cid:16) s ( p ◦ p ) ◦ = p , p (cid:16) s ( p ◦ p ) ◦ p , and p (cid:16) s ( p ◦ p ) ◦ = p . L ( R pal ) is exactly thelanguage of even-length palindromes over the alphabet { , } . Note.

Consider a DFA A accepting (among others) the palindrome s = 01100110,derived from R pal . If we were to consider A without the context of its enabledpattern instances I , we could apply p (cid:16) s ( p ◦ p ) ◦ = p to the ‘ﬁrst’ instanceof p in A , creating a DFA accepting the string 0111100110 which is not apalindrome. This illustrates the importance of the notion of enabled patterns inour framework. We have shown how a PRS can generate a sequence of DFAs that can deﬁne, inthe limit, a non-regular language. However, we are interested in the dual problem:given a sequence of DFAs generated by a PRS P , can we reconstruct P ? Coupledwith an L ∗ extraction of DFAs from a trained RNN, solving this problem willenable us to extract a PRS language from an RNN, provided the L ∗ extractionalso follows a PRS pattern (as we often ﬁnd it does).We present an algorithm for this problem, and show its correctness in Section5.1. We note that in practice the DFAs we are given are not “perfect”; theycontain noise that deviates from the PRS. We therefore augment this algorithmin Section 5.2, allowing it to operate smoothly even on imperfect DFA sequencescreated from RNN extraction.In the following, for each pattern instance ˆ p in A i , we denote by p the patternthat it is an instance of. Additionally, for each consecutive DFA pair A i and A i +1 , we refer by ˆ p to the new pattern instance in A i +1 . Main steps of inference algorithm.

Given a sequence of DFAs A · · · A n , thealgorithm infers P = (cid:104) Σ, P, P c , R (cid:105) in the following stages:1. Discover the initial pattern instance ˆ p I in A . Insert p I into P and mark ˆ p I as enabled. Insert the rule ⊥ → p I into R .2. For i, ≤ i ≤ n − p in A i +1 that extends A i .(b) If ˆ p starts at the initial state q of A i +1 , then it is an application of arule of type (1). Insert p into P and mark ˆ p as enabled, and add therule ⊥ (cid:16) p to R . ynthesizing Context-free Grammars from Recurrent Neural Networks 11 (c) Otherwise (ˆ p does not start at q ), ﬁnd the unique enabled patternˆ p = ˆ p (cid:12) ˆ p in A i s.t. ˆ p ’s initial state q is the join state of ˆ p . Add p , p ,and p to P and p to P c , and mark ˆ p ,ˆ p , and ˆ p as enabled. If ˆ p isnon-circular add the rule p (cid:16) s ( p ◦ p ) ◦ = p to R , otherwise add the rule p (cid:16) c ( p (cid:12) p ) ◦ = p to R .3. Deﬁne Σ to be the set of symbols used by the patterns P .Once we know the newly created pattern p I or ˆ p (step 1 or 2a) and the pattern ˆ p that it is grafted onto (step 2c), creating the rule is straightforward. We elaboratebelow on the how the algorithm accurately ﬁnds these patterns. Discovering new patterns ˆ p I and ˆ p The ﬁrst pattern p I is easily discovered; it is A , the ﬁrst DFA. To ﬁnd those patterns added in subsequent DFAs, we needto isolate the pattern added between A i and A i +1 , by identifying which statesin A i +1 = (cid:104) Σ, q (cid:48) , Q (cid:48) , F (cid:48) , δ (cid:48) (cid:105) are ‘new’ relative to A i = (cid:104) Σ, q , Q, F, δ (cid:105) . From thePRS deﬁnitions, we know that there is a subset of states and transitions in A i +1 that is isomorphic to A i : Deﬁnition 20. (Existing states and transitions) For every q (cid:48) ∈ Q (cid:48) , we say that q (cid:48) exists in A i , with parallel state q ∈ Q , iﬀ there exists a sequence w ∈ Σ ∗ suchthat q = ˆ δ ( q , w ) , q (cid:48) = ˆ δ (cid:48) ( q , w ) , and neither is a sink reject state. Additionally,for every q (cid:48) , q (cid:48) ∈ Q (cid:48) with parallel states q , q ∈ Q , we say that ( q (cid:48) , σ, q (cid:48) ) ∈ δ (cid:48) exists in A i if ( q , σ, q ) ∈ δ . We refer to the states and transitions in A i +1 that do not exist in A i asthe new states and transitions of A i +1 , denoting them Q N ⊆ Q (cid:48) and δ N ⊆ δ (cid:48) respectively. By construction of PRSs, each state in A i +1 has at most one parallelstate in A i , and marking A i +1 ’s existing states can be done in one simultaneoustraversal of the two DFAs, using any exploration that covers all the states of A i .The new states are a new pattern instance ˆ p in A i +1 , excluding its initial andpossibly its exit state. The initial state of ˆ p is the existing state q (cid:48) s ∈ Q (cid:48) \ Q N that has outgoing new transitions. The exit state q (cid:48) X of ˆ p is identiﬁed by thefollowing Exit State Discovery algorithm:1. If q (cid:48) s has incoming new transitions, then ˆ p is circular: q (cid:48) X = q (cid:48) s . (Fig. 3(i),(ii)).2. Otherwise p is non-circular. If ˆ p is the ﬁrst (with respect to the DFA sequence)non-circular pattern to have been grafted onto q (cid:48) s , then q X is the uniquenew state whose transitions into A i +1 are the connecting transitions fromDeﬁnition 17 (Fig. 3 (iii)).3. If there is no such state then ˆ p is not the ﬁrst non-circular pattern graftedonto q (cid:48) s . In this case, q (cid:48) X is the unique existing state q (cid:48) X (cid:54) = q (cid:48) s with newincoming transitions but no new outgoing transitions. (Fig. 3(iv)).Finally, the new pattern instance is p = (cid:104) Σ, q (cid:48) s , Q p , q (cid:48) X , δ p (cid:105) , where Q p = Q N ∪{ q (cid:48) s , q (cid:48) X } and δ p is the restriction of δ N to the states of Q p . Discovering the pattern ˆ p Once we have found the pattern ˆ p in step 2a, weneed to ﬁnd the pattern ˆ p to which it has been grafted. We begin with someobservations:1. The join state of a composite pattern is always diﬀerent from its initial andexit states ( edge states ): we cannot compose circular patterns, and there areno ‘null’ patterns.2. For every two enabled pattern instances ˆ p, ˆ p (cid:48) ∈ I i , ˆ p (cid:54) = ˆ p (cid:48) , exactly 2 optionsare possible: either (a) every state they share is an edge state of at leastone of them, or (b) one ( p s ) is contained entirely in the other ( p c ), and thecontaining pattern p c is a composite pattern with join state q j such that q j is either one of p s ’s edge states, or q j is not in p s at all.Together, these observations imply that no two enabled pattern instances ina DFA can share a join state. We prove the second observation in Appendix A.Finding the pattern ˆ p onto which ˆ p has been grafted is now straightforward.Denoting q j as the parallel of ˆ p ’s initial state in A i , we seek the enabled compositepattern instance ( p, q ) ∈ I i for which join( p, q, A i ) = q j . If none is present, weseek the only enabled instance ( p, q ) ∈ I i that contains q j as a non-edge state,but is not yet marked as a composite. (Note that if two enabled instances sharea non-edge state, we must already know that the containing one is a composite,otherwise we would not have found and enabled the other). A PRS P = (cid:104) Σ, P, P c , R (cid:105) is a minimal generator (MG) of asequence of DFAs S = A , A , ...A n iﬀ it is suﬃcient and necessary for thatsequence, i.e.: 1. it generates S , 2. removing any rule r ∈ R would render P insuﬃcient for generating S , and 3. removing any element from Σ, P, P c wouldmake P no longer a PRS. Lemma 1.

Given a ﬁnite sequence of DFAs, the minimal generator of thatsequence, if it exists, is unique.

Theorem 1.

Let A , A , ...A n be a ﬁnite sequence of DFAs that has a minimalgenerator P . Then the PRS Inference Algorithm will discover P . The proofs for these claims are given in Appendix B.

Given a sequence of DFAs generated by the rules of PRS P , the inference algorithmgiven above will faithfully infer P (Section 5.1). In practice however, we will wantto apply the algorithm to a sequence of DFAs extracted from a trained RNNusing the L ∗ algorithm (as in [27]). Such a sequence may contain noise: artifactsfrom an imperfectly trained RNN, or from the behavior of L ∗ (which does notnecessarily create PRS-like sequences). The major deviations are incorrect patterncreation, simultaneous rule applications, and slow initiation. ynthesizing Context-free Grammars from Recurrent Neural Networks 13 Incorrect pattern creation

Either due to inaccuracies in the RNN classiﬁcation,or as artifacts of the L ∗ process, incorrect patterns are often inserted into theDFA sequence. Fortunately, the incorrect patterns that get inserted are somewhatrandom and so rarely repeat, and we can discern between the ‘legitimate’ and‘noisy’ patterns being added to the DFAs using a voting and threshold scheme.The vote for each discovered pattern p ∈ P is the number of times it hasbeen inserted as the new pattern between a pair of DFAs A i , A i +1 in S . We seta threshold for the minimum vote a pattern needs to be considered valid, andonly build rules around the connection of valid patterns onto the join states ofother valid patterns. To do this, we modify the ﬂow of the algorithm: beforediscovering rules, we ﬁrst ﬁlter incorrect patterns.We modify step 2 of the algorithm, splitting it into two phases: Phase 1:

Mark the inserted patterns between each pair of DFAs, and compute their votes.Add to P those whose vote is above the threshold. Phase 2:

Consider each DFApair A i , A i +1 in order. If the new pattern in A i +1 is valid, and its initial state’sparallel state in A i also lies in a valid pattern, then synthesize the rule addingthat pattern according to the original algorithm in Section 5. Whenever a patternis discovered to be composite, we add its composing patterns as valid patterns to P . A major obstacle to our research was producing a high quality sequence ofDFAs faithful to the target language, as almost every sequence produced hassome noise. The voting scheme greatly extended the reach of our algorithm. Simultaneous rule applications

In the theoretical framework, A i +1 diﬀers from A i by applying a single PRS rule, and therefore q (cid:48) s and q (cid:48) X are uniquely deﬁned. L ∗ however does not guarantee such minimal increments between DFAs. Inparticular, it may apply multiple PRS rules between two subsequent DFAs,extending A i with several patterns. To handle this, we expand the initial andexit state discovery methods given in Section 5:1. Mark the new states and transitions Q N and δ N as before.2. Identify the set of new pattern instance initial states ( pattern heads ): the set H ⊆ Q (cid:48) \ Q N of states in A i +1 with outgoing new transitions.3. For each pattern head q (cid:48) ∈ H , compute the relevant sets δ N | q (cid:48) ⊆ δ N and Q N | q (cid:48) ⊆ Q N of new transitions and states: the members of δ N and Q N thatare reachable from q (cid:48) without passing through any existing transitions .4. For each q (cid:48) ∈ H , restrict to Q N | q (cid:48) and δ N | q (cid:48) and compute q (cid:48) X and p as before.If A i +1 ’s new patterns have no overlap and do not create an ambiguity aroundjoin states (e.g., do not both connect into instances of a single pattern whose joinstate has not yet been determined), then they may be handled independentlyand in arbitrary order. They are used to discover rules and then enabled, as inthe original algorithm.Simultaneous but dependent rule applications – such as inserting a patternand then grafting another onto its join state – are more diﬃcult to handle, as it isnot always possible to determine which pattern was grafted onto which. However, there is a special case which appeared in several of our experiments (examplesL13 ad L14 of Section 7) for which we developed a technique as follows:Suppose we discover a rule r : p (cid:16) s ( p l ◦ p r ) ◦ = p , and p contains a cycle c around some internal state q j . If later another rule inserts a pattern p n at thestate q j , we understand that p is in fact a composite pattern, with p = p ◦ p and join state q j . However, as patterns do not contain cycles at their edge states, c cannot be a part of either p or p . We conclude that the addition of p wasin fact a simultaneous application of two rules: r (cid:48) : p (cid:16) s ( p l ◦ p r ) ◦ = p (cid:48) and r : p (cid:48) (cid:16) c ( p ◦ p ) ◦ = c , where p (cid:48) is p without the cycle c , and update our PRSand our DFAs’ enabled pattern instances accordingly. The case when p is circularis handled similarly. Slow initiation

Ideally, A would directly supply an initial rule ⊥ (cid:16) p I to ourPRS. In practice, we found that the ﬁrst couple of DFAs generated by L ∗ – whichdeal with extremely short sequences – have completely incorrect structure, andit takes the algorithm some time to stabilise. Ultimately we solve this by leavingdiscovery of the initial rules to the end of the algorithm, at which point wehave a set of ‘valid’ patterns that we are sure are part of the PRS. From therewe examine the last DFA A n generated in the sequence, note all the enabledinstances ( p I , q ) at its initial state, and generate a rule ⊥ (cid:16) p I for each of them.Note however that this technique will not recognise patterns p I that do not alsoappear as an extending pattern p elsewhere in the sequence (and therefore donot meet the threshold). We present an algorithm to convert a given PRS to a context free grammar(CFG), making the rules extracted by our algorithm more accessible.

A restriction:

Let P = (cid:104) Σ, P, P c , R (cid:105) be a PRS. For simplicity, we restrict thePRS so that every pattern p can only appear on the LHS of rules of type (2) oronly on the LHS of rules of type (3) but cannot only appear on the LHS of bothtypes of rules. Similarly, we assume that for each rule ⊥→ p I , the RHS patterns p I are all circular or non-circular . In Appendix C.1 we show how to create aCFG without this restriction.We will create a CFG G = (cid:104) Σ, N, S, P rod (cid:105) , where Σ , N , S , and P rod arethe terminals (alphabet), non-terminals, start symbol and productions of thegrammar. Σ is the same alphabet of P , and we take S as a special start symbol.We now describe how we obtain N and P rod .For every pattern p ∈ P , let G p = (cid:104) Σ p , N p , Z p , P rod p (cid:105) be a CFG describing L ( p ). Recall that P C are composite patterns. Let P Y ⊆ P C be those patternsthat appear on the LHS of a rule of type (2) ( (cid:16) c ). Create the non-terminal This restriction is natural: Dyck grammars and all of the examples in Sections 4.1and 7.3 conform to this restriction.ynthesizing Context-free Grammars from Recurrent Neural Networks 15 C S and for each p ∈ P Y , create an additional non-terminal C p . We set N = { S, C S } (cid:83) p ∈ P { N p } (cid:83) p ∈ P Y { C p } .Let ⊥ (cid:16) p I be a rule in P . If p I is non-circular, create a production S ::= Z p I .If p I is circular, create the productions S ::= S C , S C ::= S C S C and S C ::= Z p I .For each rule p (cid:16) s ( p ◦ p ) ◦ = p create a production Z p ::= Z p Z p Z p . Foreach rule p (cid:16) c ( p ◦ p ) ◦ = p create the productions Z p ::= Z p C p Z p , C p ::= C p C p ,and C p ::= Z p . Let P rod (cid:48) be the all the productions deﬁned by the above process.We set

P rod = { (cid:83) p ∈ P P rod p } ∪ P rod (cid:48) . Theorem 2.

Let G be the CFG constructed from P by the procedure given above.Then L ( P ) = L ( G ) . The proof is given in Appendix C.

The class of languages expressible by a PRS

Every RE-Dyck language (Section2.2) can be expressed by a PRS. But the converse is not true; an RE-Dycklanguage requires that any delimiter pair can be embedded in any other delimiterpair while a PRS grammar provides more control over which delimiters can beembedded in which other delimiters. For instance, the language L12 of Section 7.3contains 2 pairs of delimiters and only includes strings in which the ﬁrst delimiterpair is embedded in the second delimiter pair and vice versa. L12 is expressibleby a PRS but is not a Dyck language. Hence the class of PRS languages are moreexpressive than Dyck languages and are contained in the class of CFLs. But notevery CFL can be expressed by a PRS. See Appendix C.3.

Succinctness

The construction above does not necessarily yield a minimal CFG G equivalent to P . For a PRS deﬁning the Dyck language of order 2 - which can beexpressed by a CFG with 4 productions and one non-terminal - our constructionyields a CFG with 10 non-terminals and 12 productions.In general, the extra productions can be necessary to provide more controlover what delimiter pairs can be nested in other delimiter pairs as described above.However, when these productions are not necessary, we can often post-processthe generated CFG to remove unnecessary productions. See Appendix C.2 forthe CFGs generated for the Dyck language of order 2 and for the language ofalternating delimiters. We test the algorithm on several PRS-expressible context free languages, attempt-ing to extract them from trained RNNs using the process outlined in Figure 1.For each language, we create a probabilistic CFG generating it, train an RNNon samples from this grammar, extract a sequence of DFAs from the RNN, and apply our PRS inference algorithm . Finally, we convert the extracted PRS backto a CFG, and compare it to our target CFG.In all of our experiments, we use a vote-threshold s.t. patterns with less than2 votes are not used to form any PRS rules (Section 5.2). Using no thresholdsigniﬁcantly degraded the results by including too much noise, while higherthresholds often caused us to overlook correct patterns and rules. We obtain a sequence of DFAs for a given CFG using only positive samples[11,1] bytraining a language-model RNN (LM-RNN) on these samples and then extractingDFAs from it with the aid of the L ∗ algorithm [2], as described in [27]. To apply L ∗ we must treat the LM-RNN as a binary classiﬁer. We set an ‘acceptancethreshold’ t and deﬁne the RNN’s language as the set of sequences s satisfying:1. the RNN’s probability for an end-of-sequence token after s is greater than t ,and 2. at no point during s does the RNN pass through a token with probability < t . This is identical to the concept of locally t -truncated support deﬁned in [13].(Using the LM-RNN’s probability for the entire sequence has the ﬂaw that thisdecreases for longer sequences.)To create the samples for the RNNs, we write a weighted version of the CFG,in which each non-terminal is given a probability over its rules. We then take N samples from the weighted CFG according to its distribution, split them intotrain and validation sets, and train an RNN on the train set until the validationloss stops improving. In our experiments, we used N = 10 , L ∗ nor the RNNabstraction consider long sequences, and equivalence is reached between the L ∗ hypothesis and the RNN abstraction despite neither being equivalent to the’true’ language of the RNN. In these cases we push the extraction a little furtherusing two methods: ﬁrst, if the RNN abstraction contains only a single state,we make an arbitrary initial reﬁnement by splitting 10 hidden dimensions, andrestart the extraction. If this is also not enough, we sample the RNN accordingto its distribution, in the hope of ﬁnding a counterexample to return to L ∗ . Thelatter approach is not ideal: sampling the RNN may return very long sequences,eﬀectively increasing the next DFA by many rule applications.In other cases, the extraction is long, and slows down as the extracted DFAsgrow. We place a time limit of 1 ,

000 seconds ( ∼

17 minutes) on the extraction.

We experiment on 15 PRS-expressible languages L − L , grouped into 3 classes: The implementation needs some expansions to fully apply to complex multi-composedpatterns, but is otherwise complete and works on all languages described here.ynthesizing Context-free Grammars from Recurrent Neural Networks 17LG DFAs Init Final Min/Max CFG LG DFAs Init Final Min/Max CFGPats Pats Votes Correct Pats Pats Votes Correct L

18 1 1 16/16 Correct L

30 6 4 5/8 Correct L

16 1 1 14/14 Correct L L

14 6 4 2/4 Incorrect L

24 6 3 5/12 Incorrect L L

28 2 2 13/13 Correct L

10 2 1 7/7 Correct L L

22 9 4 3/16 Incorrect L

17 5 2 5/7 Correct L

24 2 2 11/11 Correct L

13 6 4 3/6 Incorrect L

22 5 4 2/9 Partial

Table 1.

Results of experiments on DFAs extracted from RNNs

1. Languages of the form X n Y n , for various regular expressions X and Y . Inparticular, the languages L through L are X ni Y ni for: (X ,Y )=(a,b) , (X ,Y )=(a|b,c|d) , (X ,Y )=(ab|cd,ef|gh) , (X ,Y )=(ab,cd) , (X ,Y )=(abc,def) , and (X ,Y )=(ab|c,de|f) .2. Dyck and RE-Dyck languages, excluding the empty sequence. In particular,languages L through L are the Dyck languages (excluding ε ) of order 2through 4, and L and L are RE-Dyck languages of order 1 with thedelimiters (L ,R )=(abcde,vwxyz) and (L ,R )=(ab|c,de|f) .3. Variations of the Dyck languages, again excluding the empty sequence. L isthe language of alternating single-nested delimiters, generating only sequencesof the sort ([([])]) or [([])] . L and L are Dyck-1 and Dyck-2 withadditional neutral tokens a,b,c that may appear multiple times anywhere inthe sequence. L is like L except that the neutral additions are the token d and the sequence abc , eg: (abc()())d is in L , but a(bc()())d is not. Table 1 shows the results. The 2nd column shows the number of DFAs extractedfrom the RNN. The 3rd and 4th columns present the number of patterns foundby the algorithm before and after applying vote-thresholding to remove noise.The 5th column gives the minimum and maximum votes received by the ﬁnalpatterns . The 6th column notes whether the algorithm found a correct CFG,according to our manual inspection. For languages where our algorithm onlymissed or included 1 or 2 valid/invalid productions, we label it as partially correct. Alternating Patterns

Our algorithm struggled on the languages L , L , and L , which contained patterns whose regular expressions had alternations (suchas ab|cd in L , and ab|c in L and L ). Investigating their DFA sequencesuncovered the that the L ∗ extraction had ‘split’ the alternating expressions,adding their parts to the DFAs over multiple iterations. For example, in the We count only patterns introduced as a new pattern p in some A i +1 ; if p = p ◦ p ,but p is not introduced independently as a new pattern, we do not count it.8 D.M. Yellin G. Weiss sequence generated for L , ef appeared in A without gh alongside it. The nextDFA corrected this mistake but the inference algorithm could not piece togetherthese two separate steps into a single rule. It will be valuable to expand thealgorithm to these cases. Simultaneous Applications

Originally our algorithm failed to accurately generate L and L due to simultaneous rule applications. However, using the techniquedescribed in Section 5.2 we were able to correctly infer these grammars. However,more work is needed to handle simultaneous rule applications in general.Additionally, sometimes a very large counterexample was returned to L ∗ ,creating a large increase in the DFAs: the 9 th iteration of the extraction on L introduced almost 30 new states. The algorithm does not manage to inferanything meaningful from these nested, simultaneous applications. Missing Rules

For the Dyck languages L − L , the inference algorithm was mostlysuccessful. However, due to the large number of possible delimiter combinations,some patterns and nesting relations did not appear often enough in the DFAsequences. As a result, for L , some productions were missing in the generatedgrammar. L also created one incorrect production due to noise in the sequence(one erroneous pattern was generated two times). When we raised the thresholdto require more than 2 occurrences to be considered a valid pattern we no longergenerated this incorrect production. RNN Noise In L , the extracted DFAs for some reason always forced that asingle character d be included between every pair of delimiters. Our inferencealgorithm of course maintained this peculiarity. It correctly allowed the allowedoptional embedding of “abc” strings. But due to noisy (incorrect) generatedDFAs, the patterns generated did not maintain balanced parenthesis. Training RNNs to recognize Dyck Grammars.

Recently there has been a surgeof interest in whether RNNs can learn Dyck languages [5,20,22,28]. While theseworks report very good results on learning the language for sentences of similardistance and depth as the training set, with the exception of [22], they reportsigniﬁcantly less accuracy for out-of-sample sentences.Sennhauser and Berwick [20] use LSTMs, and show that in order to keepthe error rate with a 5 percent tolerance, the number of hidden units must growexponentially with the distance or depth of the sequences . They also found thatout-of-sample results were not very good. They conclude that LSTMs cannotlearn rules, but rather use statistical approximation. Bernardy [5] experimented However, see [13] where they show that a Dyck grammar of k pairs of delimiters thatgenerates sentences of maximum depth m , only 3 m (cid:100) log k (cid:101) − m hidden memory unitsare required, and show experimental results that conﬁrm this theoretical bound inpractice.ynthesizing Context-free Grammars from Recurrent Neural Networks 19 with various RNN architectures. When they test their RNNs on strings that areat most double in length of the training set, they found that for out-of-samplestrings, the accuracy varies from about 60 to above 90 percent. The fact thatthe LSTM has more diﬃculty in predicting closing delimiters in the middleof a sentence than at the end leads Bernardy to conjecture that for closingparenthesis the RNN is using a counting mechanism, but has not truly learntthe Dyck language (its CFG). Skachkova, Trost and Klakow [22] experimentwith Ellman-RNN, GRU and LSTM architectures. They provide a mathematicalmodel for the probability of a particular symbol in the i th position of a Dycksentence. They experiment with how well the models predict the closing delimiter,which they ﬁnd varying results per architecture. However, for LSTMs, they ﬁndnearly perfect accuracy across words with large distances and embedded depth.Yu, Vu and Kuhn [28] compares the three works above and argue that thetask of predicting a closing bracket of a balanced Dyck word, as performed in [22],is a poor test for checking if the RNN learnt the language, as it can be simplycomputed by a counter. In contrast, their carefully constructed experiments givea preﬁx of a Dyck word and train the RNN to predict the next valid closingbracket. They experiment with an LSTM using 4 diﬀerent models, and show thatthe generator-attention model [18] performs the best, and is able to generalizequite well at the tagging task . However, when using RNNs to complete the entireDyck word, while the generator-attention model does quite well with in-domaintests, it degrades rapidly with out-of-domain tests. They also conclude that RNNsdo not really learn the CFG underlying the Dyck language. These experimentalresults are reinforced by the theoretical work in [13]. They remark that no ﬁniteprecision RNN can learn a Dyck language of unbounded depth, and give precisebounds on the memory required to learn a Dyck language of bounded depth.In contrast to these works, our research tries to extract the CFG from theRNN. We discover these rules based upon DFAs synthesized from the RNNusing the algorithm in [27]. Because we can use a short sequence of DFAs toextract the rules, and because the ﬁrst DFAs in the sequence describe Dyckwords with increasing but limited distance and depth, we are able to extract theCFG perfectly, even when the RNN does not generalize well. Moreover, we showthat our approach generalizes to more complex types of delimiters, and to Dycklanguages with expressions between delimiters. Extracting DFAs from RNNs.

There have been many approaches to extract higherlevel representations from a neural network (NN) to facilitate comprehension andverify correctness. One of the oldest approaches is to extract rules from a NN[25,12]. In order to model state, there have been various approaches to extractFSA from RNNs [19,15,26]. We base our work on [27]. Its ability to generatesequences of DFAs that increasingly better approximate the CFL is critical toour method.Unlike DFA extraction, there has been relatively little research on extracting aCFG from an RNN. One exception is [24], where they develop a Neural NetworkPushdown Automata (NNPDA) framework, a hybrid system augmenting anRNN with external stack memory. The RNN also reads the top of the stack as added input, optionally pushes to or pops the stack after each new input symbol.They show how to extract a Push-down Automaton from a NNPDA, however,their technique relies on the PDA-like structure of the inspected architecture. Incontrast, we extract CFGs from RNNs without stack augmentation.

Learning CFGs from samples.

There is a wide body of work on learning CFGsfrom samples. An overview is given in [10] and a survey of work for grammaticalinference applied to software engineering tasks can be found in [23].Clark et. al. studies algorithms for learning CFLs given only positive examples[11]. In [7], Clark and Eyraud show how one can learn a subclass of CFLs called

CF substitutable languages. There are many languages that can be expressed by aPRS but are not substitutable, such as x n b n . However, there are also substitutablelanguages that cannot be expressed by a PRS ( wxw R - see Appendix C.3). In[8], Clark, Eyraud and Habrard present Contextual Binary Feature Grammars.However, it does not include Dyck languages of arbitrary order. None of thesetechniques deal with noise in the data, essential to learning a language from anRNN. While we have focused on practical learning of CFLs, theoretical limits onlearning based upon positive examples is well known; see[11,1]. Currently, for each experiment, we train the RNN on that language and thenapply the PRS inference algorithm on a single DFA sequence generated fromthat RNN. Perhaps the most substantial improvement we can make is to extendour technique to learn from multiple DFA sequences. We can train multipleRNNs (each one based upon a diﬀerent architecture if desired) and generate DFAsequences for each one. We can then run the PRS inference algorithm on eachof these sequences, and generate a CFG based upon rules that are found in asigniﬁcant number of the runs. This would require care to guarantee that theﬁnal rules form a cohesive CFG. It would also address the issue that not all rulesare expressed in a single DFA sequence, and that some grammars may have rulesthat are executed only once per word of the language.Our work generates CFGs for generalized Dyck languages, but it is possible togeneralize PRSs to express a greater range of languages. Work will be needed toextend the PRS inference algorithm to reconstruct grammars for all context-freeand perhaps even some context-sensitive languages.

A Observation on PRS-Generated Sequences

We present and prove an observation on PRS-generated sequences used forderiving the PRS-inference algorithm (Section 5).

Lemma 2.

Let (cid:104) A i , I i (cid:105) be a PRS-generated enabled DFA. Then for every twoenabled pattern instances ˆ p, ˆ p (cid:48) ∈ I i , ˆ p (cid:54) = ˆ p (cid:48) , exactly 2 options are possible: 1. everystate they share is the initial or exit state ( edge state ) of at least one of them, or ynthesizing Context-free Grammars from Recurrent Neural Networks 21

2. one ( ˆ p s ) is contained entirely in the other ( ˆ p c ), and ˆ p c is a composite patternwith join state q j such that either q j is one of ˆ p s ’s edge states, or q j is not in ˆ p s at all.Proof. We prove by induction. For (cid:104) A , I (cid:105) , |I | ≤ (cid:104) A i , I i (cid:105) .Applying a rule of type (1) adds only one new instance ˆ p I to I i +1 , whichshares only its initial state with the existing patterns, and so option 1 holds.Rules of type (2) and (3) add up to three new enabled instances, ˆ p , ˆ p , andˆ p , to I i +1 . ˆ p only shares its edge states with A i , and so option (1) holds betweenˆ p and all existing instances ˆ p (cid:48) ∈ I i , as well as the new ones ˆ p and ˆ p if theyare added (as their states are already contained in A i ).We now consider the case where ˆ p and ˆ p are also newly added (i.e. ˆ p , ˆ p / ∈I i ). We consider a pair ˆ p i , ˆ p (cid:48) where i ∈ { , } . As ˆ p and ˆ p only share their joinstates with each other, and both are completely contained in ˆ p such that ˆ p ’sjoin state is one of their edge states, the lemma holds for each of ˆ p (cid:48) ∈ { ˆ p , ˆ p , ˆ p } .We move to ˆ p (cid:48) (cid:54) = ˆ p , ˆ p , ˆ p . Note that ( i ) ˆ p (cid:48) cannot be contained in ˆ p , as we areonly now splitting ˆ p into its composing instances, and ( ii ), if ˆ p shares any of itsedge states with ˆ p i , then it must also be an edge state of ˆ p i (by construction ofcomposition).As ˆ p i is contained in ˆ p , the only states that can be shared by ˆ p i and ˆ p (cid:48) arethose shared by ˆ p and ˆ p (cid:48) . If ˆ p, ˆ p (cid:48) satisfy option 1, i.e., they only share edge states,then this means any states shared by ˆ p (cid:48) and ˆ p i are edge states of ˆ p (cid:48) or ˆ p . Clearly,ˆ p (cid:48) edge states continue to be ˆ p (cid:48) edge states. As for each of ˆ p ’s edge states, by( ii ), it is either not in ˆ p i , or necessarily an edge state of ˆ p i . Hence, if ˆ p, ˆ p (cid:48) satisfyoption 1, then ˆ p i , ˆ p (cid:48) do too.Otherwise, by the assumption on (cid:104) A i , I i (cid:105) , option 2 holds between ˆ p (cid:48) and ˆ p ,and from ( i ) ˆ p (cid:48) is the containing instance. As ˆ p i composes ˆ p , then ˆ p (cid:48) also containsˆ p i . Moreover, by deﬁnition of option 2, the join state of ˆ p (cid:48) is either one of ˆ p ’sedge states or not in ˆ p at all, and so from ( ii ) the same holds for ˆ p i . B Correctness of the Inference Algorithm

Lemma 1.

Given a ﬁnite sequence of DFAs, the minimal generator of thatsequence, if it exists, is unique.

Proof:

Say that there exists two MGs, P = (cid:104) Σ , P , P c , R (cid:105) and P = (cid:104) Σ , P , P c , R (cid:105) that generate the sequence A , A , · · · , A n . Certainly Σ = Σ = (cid:83) i ∈ [ n ] Σ A i .We show that R = R . Say that the ﬁrst time MG1 and MG2 diﬀer fromone another is in explaining which rule is used when expanding from A i to A i +1 .Since MG1 and MG2 agree on all rules used to expand the sequence prior to A i +1 , they agree on the set of patterns enabled in A i . If this expansion is addinga pattern p originating at the initial state of the DFA, then it can only beexplained by a single rule ⊥ (cid:16) p , and so the explanation of MG1 and MG2 isidentical. Hence the expansion must be created by a rule of type (2) or (3). Since the newly added pattern instance ˆ p is is uniquely identiﬁable in A i +1 , P and P must agree on the pattern p that appears on the RHS of the rule explainingthis expansion. ˆ p is inserted at some state q j of A i . q j must be the join state ofan enabled pattern instance ˆ p in A i . But this join state uniquely identiﬁes thatpattern: as noted in Section 5, no two enabled patterns in a enabled DFA share ajoin state. Hence P and P must agree that the pattern p = p ◦ p is the LHSof the rule, and they therefore agree that the rule is p (cid:16) s ( p ◦ p ) ◦ = p , if p isnon-circular, or p (cid:16) c ( p (cid:12) p ) ◦ = p if p is circular. Hence R = R .Since P ( P ) is an MG, it must be that p ∈ P ( p ∈ P ) iﬀ p appears in arule in R ( R ). Since R = R , P = P . Furthermore, a pattern p ∈ P c iﬀ itappears on the LHS of a rule. Therefore P c = P c . Theorem 1.

Let A , A , ...A n be a ﬁnite sequence of DFAs that has a minimalgenerator P . Then the PRS Inference Algorithm will discover P . Proof:

This proof mimics the proof in the Lemma above. In this case P = (cid:104) Σ , P , P c , R (cid:105) is the MG for this sequence and P = (cid:104) Σ , P , P c , R (cid:105) is thePRS discovered by the PRS inference algorithm.We need to show that the PRS inference algorithm faithfully follows thesteps above for P . This straightforward by comparing the steps of the inferencealgorithm with the steps for P . One subtlety is to show that the PRS inferencealgorithm correctly identiﬁes the new pattern ˆ p in A i +1 extending A i . Thealgorithm easily ﬁnds all the newly inserted states and transitions in A i +1 . Allof the states, together with the initial state, must belong to the new pattern.However not all transitions necessarily belong to the pattern. The Exit StateDiscovery algorithm of Section 5 correctly diﬀerentiates between new transitionsthat are part of the inserted pattern and those that are connecting transitions (The set C of Deﬁnition 17). Hence the algorithm correctly ﬁnds the new patternin A i +1 . C The expressibility of a PRS

We present a proof to Theorem 2 showing that the CFG created from a PRSexpresses the same language.

Theorem 2.

Let G be the CFG constructed from P by the procedure given inSection 6. Then L ( P ) = L ( G ) . Proof:

Let s ∈ L ( P ). Then there exists a sequence of DFAs A · · · A m generated by P s.t. s ∈ L ( A m ). We will show that s ∈ L ( G ). W.l.g. we assumethat each DFA in the sequence is necessary; i.e., if the rule application to A i creating A i +1 were absent, then s / ∈ L ( A m ). We will use the notation ˆ p to referto a speciﬁc instance of a pattern p in A i for some i (1 ≤ i ≤ m ), and we adoptfrom Section 4 the notion of enabled pattern instances. So, for instance, if weapply a rule p (cid:16) s ( p ◦ p ) ◦ = p , where p = p ◦ p , to an instance of ˆ p in A i , then A i +1 will contain a new path through the enabled pattern instances ˆ p , ˆ p andˆ p . ynthesizing Context-free Grammars from Recurrent Neural Networks 23 A p-path (short for pattern-path ) through a DFA A i is a path ρ = q → p q → p · · · q t − → p t q t , where q and q t are the initial and ﬁnal states of A i respectively, and for each transition q j → p j +1 q j +1 , q j (0 ≤ j ≤ t −

1) is the initialstate of an enabled pattern instance of type p j +1 and q j +1 is the ﬁnal state ofthat pattern instance. A state may appear multiple times in the path if there isa cycle in the DFA and that state is traversed multiple times. If ˆ p is an enabledcircular pattern and the path contains a cycle that traverses that instance of p ,and only that instance, multiple times consecutively, it is only represented oncein the path, since that cycle is completely contained within that pattern; a p-pathcannot contain consecutive self-loops q j → p q j → p q j . P ats ( ρ ) = p p · · · p t , theinstances of the patterns traversed along the path ρ .We say that a p-path ρ = q → p q → p · · · q t − → p t q t through A m is an acceptor (of s ) iﬀ s = s · · · s t and s i ∈ L ( p i ) for all i (1 ≤ i ≤ t ). DFAs earlier inthe sequence are not acceptors as they contain patterns that have not yet beenexpanded. But we can “project” the ﬁnal p-path onto a p-path in an earlier DFA.We do so with the following deﬁnition of a p-cover : – If a path ρ is an acceptor, then it is a p-cover. – Let p be a pattern and let A i +1 be obtained from A i by application of therule p (cid:16) s ( p ◦ p ) ◦ = p or p (cid:16) c ( p (cid:12) p ) ◦ = p to ˆ p in A i obtaining a sub-path q → p q → p q → p q through instances ˆ p , ˆ p and ˆ p . Furthermore, saythat the p-path ρ ( i +1) through A i +1 is a p-cover. Then the path ρ ( i ) through A i is p-cover, where ρ ( i ) is obtained from ρ ( i +1) by replacing each occurrenceof q → p q → p q → p q in ρ ( i +1) traversing ˆ p , ˆ p and ˆ p by the singletransition q → p q traversing ˆ p in ρ ( i ) . (If p is circular then q = q ). Ifthis results in consecutive self loops q → p q → p q we collapse them into asingle cycle, q → p q . – Let A i +1 be obtained by applying a rule ⊥ (cid:16) p I to A i obtaining an instanceof ˆ p I , where p I is a circular pattern (Defn. 15). Furthermore, say that thep-path ρ ( i +1) through A i +1 is a p-cover. Then the path ρ ( i ) through A i isp-cover, where ρ ( i ) is obtained from ρ ( i +1) by replacing each occurrence of q → p I q traversing ˆ p I by the single state q .Hence we can associate with each A i , ≤ i ≤ m a unique p-cover ρ ( i ) .Let T be a partial derivation tree for the CFG G , where every branch ofthe tree terminates with a non-terminal Z p for some pattern p . We write ˆ Z p fora particular instance of Z p in T . Leaves ( T ) is the list of patterns obtained byconcatenating all the leaves (left-to-right) in T and replacing each leaf Z p k bythe pattern p k .We claim that for each A i with p-cover ρ ( i ) there exists a partial derivationtree T ( i ) such that P ats ( ρ ( i ) ) = Leaves ( T ( i ) ). We show this by induction.For the base case, consider A , which is formed by application of a rule ⊥ (cid:16) p I .By construction of G , there exists a production S ::= Z p I . ρ (1) = s → p I s f ,where S and s f are the initial and ﬁnal states of p I respectively, and let T (1) bethe tree formed by application of the production S ::= Z p I creating the instanceˆ Z p I . Hence P ats ( ρ (1) ) = p I = Leaves ( T (1) ). For the inductive step assume that for A i there exists T ( i ) s.t. P ats ( ρ ( i ) ) = Leaves ( T ( i ) ). Say that A i +1 is formed from A i by applying the rule p (cid:16) c ( p (cid:12) p ) ◦ = p (of type (2)) or p (cid:16) s ( p ◦ p ) ◦ = p (of type (3)) to an instance ˆ p of p in A i , where the initial state of ˆ p is q and its ﬁnal state is q ( q = q if p iscircular) and there is a sub-path in A i of the form q → p q . After applying thisrule there is an additional sub-path q → p q → p q → p q in A i +1 traversingˆ p , ˆ p and ˆ p . We consider two cases:Case 1. p is non-circular. The sub-path q → p q may appear multiple times in ρ ( i ) even though p is non-circular, since it may be part of a larger cycle. Considerone of these instances where q → p q gets replaced by q → p q → p q → p q in ρ ( i +1) . Say that this instance of ˆ p is represented by pattern p at position u in P ats ( ρ ( i ) ). In ρ ( i +1) , the sub-list of patterns p , p , p will replace p at thatposition (position u ). By induction there is a pattern p in Leaves ( T ( i ) ) at position u and let ˆ Z p be the non-terminal instance in T ( i ) corresponding to that pattern p . If the rule being applied is of type (3) then, by construction of G , there existsa production Z p ::= Z p Z p Z p . We produce T ( i +1) by extending T ( i ) at thatinstance of Z p by applying that production to ˆ Z p . If the rule is of type (2), thenwe produce T ( i +1) by extending T ( i ) at that instance of Z p by applying theproductions Z p ::= Z p C p Z p and C p ::= Z p , which exist by the construction of G . Hence both P ats ( ρ ( i +1) ) and Leaves ( T ( i +1) ) will replace p at position u by p , p , p . We do this for each traversal of ˆ p in ρ ( i ) that gets replaced in ρ ( i +1) bythe traversal of ˆ p , ˆ p , and ˆ p . By doing so, P ats ( ρ ( i +1) ) = Leaves ( T ( i +1) ).Case 2: p is circular. This is similar to the previous case except this time, since p is circular, we may need to replace a single sub-path q → p q corresponding toan instance of ˆ p in ρ ( i ) by multiple explicit cycles as deﬁned by ρ ( i +1) . Each cyclewill either traverse q → p q or the longer sub-path q → p q → p q → p q .Say that there exists an instance ˆ p represented by pattern p at position u in P ats ( ρ ( i ) ) that gets replaced in ρ ( i +1) by explicit cycles; i.e., ρ ( i +1) replaces q → p q traversing ˆ p in ρ ( i ) with a new sub-path σ in ρ ( i +1) containing x cycles q → p q → p q → p q interspersed with y cycles q → p q , where p = p ◦ c p .(Per deﬁnition of a p-path, there cannot be two consecutive instances of theselatter cycles). Hence in total σ may enter and leave q a total of z = x + y times.By induction there is a pattern p in Leaves ( T ( i ) ) at position u and let ˆ Z p be thenon-terminal instance in T ( i ) corresponding to that pattern p . By construction of G , since p is circular, the parent of ˆ Z p is an instance ˆ C p (cid:48) of the non-terminal C p (cid:48) for some pattern p (cid:48) and there exists productions C p (cid:48) ::= C p (cid:48) C p (cid:48) , and C p (cid:48) ::= Z p .Using these productions we replace this single instance ˆ C p (cid:48) by z copies of C p (cid:48) .If the j th cycle of σ is q → p q then we have the j th instance of C p (cid:48) derive Z p without any further derivations. If the j th cycle is q → p q → p q → p q , thenwe also have the j th instance of C p (cid:48) derive Z p . However, if the rule being appliedis of type (3) then that instance of Z p derives Z p Z p Z p . If it is of type (2) thenthat instance of Z p derives Z p C p Z p and C p derives Z p . Hence both P ats ( ρ ( i ) )and Leaves ( T ( i ) ) will replace p at position u by x copies of p , p , p intermixedwith y copies of p . We do this for each traversal of ˆ p in ρ ( i ) that gets expanded in ρ ( i +1) by application of this rule. By doing so, P ats ( ρ ( i +1) ) = Leaves ( T ( i +1) ). ynthesizing Context-free Grammars from Recurrent Neural Networks 25 To complete the inductive step, we need to consider the case when A i +1 isformed from A i by applying a rule ⊥ (cid:16) p I (cid:48) , where p I (cid:48) is circular, per Defn. 15.This will insert p I (cid:48) into P ats ( ρ ( i +1) ) at a point when ρ ( i ) is at the initial state q . Say that there exists a sub-path σ = q → p q → p · · · q e → p e q in ρ ( i ) .Then the application of this rule may add the sub-path q → p I (cid:48) q either at thebeginning or end of σ in ρ ( i +1) . W.l.g. assume it gets asserted at the end of thissub-path, and p e occurs at position u . Then P ats ( ρ ( i +1) ) will extend P ats ( ρ ( i ) )by inserting p I (cid:48) at position u + 1 in ρ ( i ) . Since σ is a cycle, starting and endingat q , there must be an instance ˆ C S of C S in T ( i ) where C S is derived by oneor more productions of the form S ::= C S and C S ::= C S C S . Furthermore,ˆ C S derives a sub-tree T s.t. Leaves ( T ) = P ats ( σ ). By construction of G , thereexists a production C S ::= C p (cid:48) I . We add the production C S ::= C S C S to ˆ C S sothat the ﬁrst child C S derives T as in T ( i ) . At the second instance we apply theproduction C S ::= C p (cid:48) I . Hence p (cid:48) I will appear at position u + 1 in T ( i +1) . Werepeat this for each cycle involving q in ρ ( i ) that gets extended by the pattern p I (cid:48) in ρ ( i +1) . By doing so, P ats ( ρ ( i +1) ) = Leaves ( T ( i +1) ). A similar argumentholds if p I (cid:48) is added to the ﬁrst position in P ats ( ρ ( i +1) ).Hence we have shown that P ats ( ρ ( m ) ) = Leaves ( T ( m ) ). Let P ats ( ρ ( m ) ) = p · · · p t . Since ρ ( m ) is an acceptor for s , it must be that there exists s j ∈ Σ + (1 ≤ j ≤ t ) s.t. s j ∈ L ( p j ) and s = s · · · s t . But since Leaves ( T ( m ) ) = Z p · · · Z p t and each Z p j can derive s j , we can complete the derivation of T ( m ) to derive s .This shows that s ∈ L ( P R ) = ⇒ s ∈ L ( G ). The converse is also true and can beshown by similar technique so we leave the proof to the reader. C.1 Constructing a CFG from an unrestricted PRS

The construction of Section 6 assumed a restriction that a pattern p cannotappear on the LHS of rules of type (2) and of type (3). I.e., we cannot have tworules of the form p (cid:16) c ( p (cid:12) p ) ◦ = p and p (cid:16) s ( p ◦ p ) ◦ = p . If we were to allowboth of these rules then one could construct a path through a DFA instance thatﬁrst traverses an instance of p , then traverses instance of the circular pattern p (cid:48) any number of times, then traverses an instance of p , and then traverses p .However the current grammar does not allow such constructions; the non-terminal Z p can either derive Z p followed by Z p followed by Z p or, in place of Z p , anynumber of instances of C p that in turn derives Z p (cid:48) .Hence to remove this restriction, we modify the constructed CFG. FollowingSection 6, for every pattern p ∈ P , G p is the CFG with Start symbol Z p andnon-terminals N p . P Y are the patterns appearing on the LHS of some rule of type(2). Given the PRS P = (cid:104) Σ, P, P C , R (cid:105) we create a CFG G = ( Σ, N, S, P rod ),where N = { S, C S , C (cid:48) S } (cid:83) p ∈ P { N p } (cid:83) p ∈ P Y { C p , C (cid:48) p } .Create the productions S ::= C (cid:48) S , S ::= C S C (cid:48) S and C S ::= C S C S . Let ⊥ (cid:16) p I be a rule in P . Create the production C (cid:48) S ::= Z p I . If p I is circular, create theadditional production C S ::= Z p I .For each rule p (cid:16) c ( p (cid:12) p ) ◦ = p or p (cid:16) s ( p ◦ p ) ◦ = p create the productions Z p (cid:16) Z p C (cid:48) p Z p and C (cid:48) p ::= Z p . For each rule p (cid:16) c ( p (cid:12) p ) ◦ = p create the additional productions Z p ::= Z p C p C (cid:48) p Z p , C p ::= C p C p , and C p ::= Z p . Let P rod (cid:48) be the all the productions deﬁned by the process just given.

P rod = { (cid:83) p ∈ P P rod p } ∪ P rod (cid:48) . C.2 Example of a CFG generated from a PRS

The following is the CFG generated for the Dyck Language of order 2 ( L ofSection 7.3) . S ::= SCSC ::= SC SC | P1 | P2P1::= ( P1C )P1C ::= P1C P1C | P1 | P2P2::= [ P2C ]P2C ::= P2C P2C | P1 | P2

To illustrate that sometimes the extra non-terminals generated by the algo-rithm are necessary, the following is the generated CFG for alternating delimiters,( L of Section 7.3). S ::= P1 | P2P1::= ( P2 )P2::= [ P1 ]

C.3 Limitations on the expressibility of a PRS

Not every CFL is expressible by a PRS. In particular, let Σ be some alphabet, w ∈ Σ ∗ , w R be the reverse of w , x a symbol not in Σ and L R = { wxw R : w ∈ Σ ∗ } ,the inﬁnite language of palindromes of odd length. L R is a CFL but is notexpressible by a PRS. Every word in L R contains a single x .Assume there exists a PRS P s.t. L ( P ) = L R . P contains a ﬁnite numberof initial rules ⊥ → p I . Every word recognized by A = A P I must be of theform wxw R and therefore traverses a straight path ρ w from q to q f in A .Hence only a ﬁnite subset of L R is recognized from these initial rules and theremust be at least one rule that has an initial pattern p I on its LHS. Applyingthis rule to A will create a new DFA A with a new pattern p grafted ontosome state in A . This creates the new path from q to q f in A of the form ρ = ρ pρ for some w , where ρ w = ρ ρ . Since ρ w recognizes wxw R , x is a symbolrecognized along the path ρ or ρ . Assume x is recognized along the path ρ ;i.e., ρ recognizes the string wxu , ρ recognizes the string v and uv = w R . Then wxuαv ∈ L ( A ) ⊆ L ( P ), where α ∈ L ( p ) and | α | ≥

1. But | w | < | uαv | andtherefore wxuαv / ∈ L R . A similar argument holds if x is recognized along thepath ρ . We therefore conclude that no such P recognizing L R exists.It is interesting to note that the language L pal = { ww R : w ∈ Σ ∗ } isexpressible by a PRS (Section 4.1) as is L R ∪ L pal . As previously noted, the terminals, in this case “(”,“)”, “[”,“]”, are actually repre-sented as base patterns.ynthesizing Context-free Grammars from Recurrent Neural Networks 27

References

1. Angluin, D.: Inductive inference of formal languages from positive data. Inf. Control. (2), 117–135 (1980), https://doi.org/10.1016/S0019-9958(80)90285-52. Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. , 1725–1745 (2007), http://dl.acm.org/citation.cfm?id=13145568. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference ofcontext free languages. In: Clark, A., Coste, F., Miclet, L. (eds.) GrammaticalInference: Algorithms and Applications, 9th International Colloquium, ICGI 2008,Proceedings. Lecture Notes in Computer Science, vol. 5278, pp. 29–42. Springer(2008). https://doi.org/10.1007/978-3-540-88009-7 39. Das, S., Giles, C.L., Sun, G.: Learning context-free grammars: Capabilities andlimitations of a recurrent neural network with an external stack memory. In:Conference of the Cognitive Science Society. pp. 791–795. Morgan KaufmannPublishers (1992)10. D’Ulizia, A., Ferri, F., Grifoni, P.: A survey of grammatical inference meth-ods for natural language learning. Artif. Intell. Rev. (1), 1–27 (2011).https://doi.org/10.1007/s10462-010-9199-111. Gold, E.M.: Language identiﬁcation in the limit. Information and Control (5),447–474 (May 1967), https://doi.org/10.1016/S0019-9958(67)91165-512. Hailesilassie, T.: Rule extraction algorithm for deep neural networks: A review.International Journal of Computer Science and Information Security (IJCSIS) (8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.173515. Jacobsson, H.: Rule extraction from recurrent neural networks: Ataxonomy and review. Neural Computation (6), 1223–1263 (2005).https://doi.org/10.1162/08997660536303508 D.M. Yellin G. Weiss16. Korsky, S.A., Berwick, R.C.: On the Computational Power of RNNs. CoRR abs/1906.06349 (2019), http://arxiv.org/abs/1906.0634917. Kozen, D.C.: The Chomsky—Sch¨utzenberger theorem. In: Automata and Com-putability. pp. 198–200. Springer Berlin Heidelberg, Berlin, Heidelberg (1977)18. Luong, T., Pham, H., Manning, C.D.: Eﬀective approaches to attention-basedneural machine translation. In: M`arquez, L., Callison-Burch, C., Su, J., Pighin, D.,Marton, Y. (eds.) Proceedings of the 2015 Conference on Empirical Methods inNatural Language Processing, EMNLP 2015. pp. 1412–1421. The Association forComputational Linguistics (2015). https://doi.org/10.18653/v1/d15-116619. Omlin, C.W., Giles, C.L.: Extraction of rules from discrete-time recurrent neu-ral networks. Neural Networks (1), 41–52 (1996). https://doi.org/10.1016/0893-6080(95)00086-020. Sennhauser, L., Berwick, R.: Evaluating the ability of LSTMs to learn context-free grammars. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks for NLP. pp. 115–124. Association forComputational Linguistics (Nov 2018). https://doi.org/10.18653/v1/W18-541421. Siegelmann, H.T., Sontag, E.D.: On the Computational Power of Neural Nets. J.Comput. Syst. Sci. (1), 132–150 (1995). https://doi.org/10.1006/jcss.1995.101322. Skachkova, N., Trost, T., Klakow, D.: Closing brackets with recurrent neural net-works. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP. pp. 232–239. Association for ComputationalLinguistics (Nov 2018). https://doi.org/10.18653/v1/W18-542523. Stevenson, A., Cordy, J.R.: A survey of grammatical inference in soft-ware engineering. Sci. Comput. Program. (P4), 444–459 (Dec 2014).https://doi.org/10.1016/j.scico.2014.05.00824. Sun, G., Giles, C.L., Chen, H.: The neural network pushdown automaton: Architec-ture, dynamics and training. In: Giles, C.L., Gori, M. (eds.) Adaptive Processingof Sequences and Data Structures, International Summer School on Neural Net-works. Lecture Notes in Computer Science, vol. 1387, pp. 296–345. Springer (1997).https://doi.org/10.1007/BFb005400325. Thrun, S.: Extracting rules from artiﬁcal neural networks with dis-tributed representations. In: Tesauro, G., Touretzky, D.S., Leen, T.K.(eds.) Advances in Neural Information Processing Systems 7, NIPS Con-ference, 1994. pp. 505–512. MIT Press (1994), http://papers.nips.cc/paper/924-extracting-rules-from-artiﬁcial-neural-networks-with-distributed-representations26. Wang, Q., Zhang, K., Liu, X., Giles, C.L.: Connecting ﬁrst and second orderrecurrent networks with deterministic ﬁnite automata. CoRR abs/1911.04644abs/1911.04644