Fast semantic parsing with well-typedness guarantees
FFast semantic parsing with well-typedness guarantees
Matthias Lindemann and
Jonas Groschwitz and
Alexander Koller
Department of Language Science and TechnologySaarland University {mlinde|jonasg|koller}@coli.uni-saarland.de
Abstract
AM dependency parsing is a linguisticallyprincipled method for neural semantic pars-ing with high accuracy across multiple graph-banks. It relies on a type system that modelssemantic valency but makes existing parsersslow. We describe an A* parser and a transi-tion-based parser for AM dependency parsingwhich guarantee well-typedness and improveparsing speed by up to 3 orders of magnitude,while maintaining or improving accuracy.
Over the past few years, the accuracy of neuralsemantic parsers which parse English sentencesinto graph-based semantic representations has in-creased substantially (Dozat and Manning, 2018;Zhang et al., 2019; He and Choi, 2020; Cai andLam, 2020). Most of these parsers use a neuralmodel which can freely predict node labels andedges, and most of them are tailored to a specifictype of graphbank.Among the high-accuracy semantic parsers, the
AM dependency parser of Groschwitz et al. (2018)stands out in that it implements the Principle ofCompositionality from theoretical semantics in aneural framework. By parsing into AM dependencytrees, which represent the compositional structureof the sentence and evaluate deterministically intographs, this parser can abstract away surface de-tails of the individual graphbanks. It was the firstsemantic parser which worked well across multiplegraphbanks, and set new states of the art on severalof them (Lindemann et al., 2019).However, the commitment to linguistic princi-ples comes at a cost: the AM dependency parseris slow. A key part of the parser is that AM depen-dency trees must be well-typed according to a typesystem which ensures that the semantic valency ofeach word is respected. Existing algorithms com-pute all items along a parsing schema that encodes the type constraints; they parse e.g. the AMRBankat less than three tokens per second.In this paper, we present two fast and accurateparsing algorithms for AM dependency trees. Wefirst present an A* parser which searches throughthe parsing schema of Groschwitz et al.’s “pro-jective parser” efficiently (§4). We extend thesupertag-factored heuristic of Lewis and Steed-man’s (2014) A* parser for CCG with a heuristicfor dependency edge scores. This parser achievesa speed of up to 2200 tokens/s on semantic de-pendency parsing (Oepen et al., 2015), at no lossin accuracy. On AMR corpora (Banarescu et al.,2013), it achieves a speedup of 10x over previouswork, but still does not exceed 30 tokens/second.We therefore develop an entirely new transition-based parser for AM dependency trees, inspired bythe stack-pointer parser of Ma et al. (2018) for syn-tactic dependency parsing (§5). The key challengehere is to adhere to complex symbolic constraints– the AM algebra’s type system – without runninginto dead ends. This is hard for a greedy transi-tion system and in other settings requires expensiveworkarounds, such as backtracking. We ensure thatour parser avoids dead ends altogether. We definetwo variants of the transition-based parser, whichchoose types for words either before predictingthe outgoing edges or after, and introduce a neu-ral model for predicting transitions. In this way,we guarantee well-typedness with O ( n ) parsingcomplexity, achieve a speed of several thousandtokens per second across all graphbanks, and evenimprove the parsing accuracy over previous AMdependency parsers by up to 1.6 points F-score. In transition-based parsing , a dependency tree isbuilt step by step using nondeterministic transitions.A classifier is trained to choose transitions deter- a r X i v : . [ c s . C L ] O c t inistically (Nivre, 2008; Kiperwasser and Gold-berg, 2016). Transition-based parsing has also beenused for constituency parsing (Dyer et al., 2016)and graph parsing (Damonte et al., 2017). We buildmost directly upon the top-down parser of Ma et al.(2018). Unlike most other transition-based parsers,our parser implements hard symbolic constraints inorder to enforce well-typedness. Such constraintscan lead transition systems into dead ends, requir-ing the parser to backtrack (Ytrestøl, 2011) or re-turn partial analyses (Zhang and Clark, 2011). Ourtransition system carefully avoids dead ends. Shiand Lee (2018) take hard valency constraints intoaccount in chart-based syntactic dependency pars-ing, avoiding dead ends by relaxing the constraintsslightly in practice.A* parsing is a method for speeding up agenda-based chart parsers, which takes items off theagenda based on a heuristic estimate of completioncost. A* parsing has been used successfully forPCFGs (Klein and Manning, 2003), TAG (Bladieret al., 2019), and other grammar formalisms. Ourwork is based most closely on the CCG A* parserof Lewis and Steedman (2014).Most approaches that produce semantic graphs(see Koller et al. (2019) for an overview) model dis-tributions over graphs directly (Dozat and Manning,2018; Zhang et al., 2019; He and Choi, 2020; Caiand Lam, 2020), while others make use of deriva-tion trees that compositionally evaluate to graphs(Groschwitz et al., 2018; Chen et al., 2018; Fan-cellu et al., 2019; Lindemann et al., 2019). AMdependency parsing belongs to the latter category. We begin by sketching the AM dependency parserof Groschwitz et al. (2018).
Groschwitz et al. (2018) use
AM dependency trees to represent the compositional structure of a seman-tic graph. Each token is assigned a graph constant representing its lexical meaning; dependency la-bels correspond to operations of the
Apply-Modify(AM) algebra (Groschwitz et al., 2017; Groschwitz,2019), which combine graphs into bigger ones.Fig. 2 illustrates how an AM dependency tree(a) evaluates to a graph (b), based on the graphconstants in Fig. 1. Each graph constant is an as-graph , which means it has special node mark-ers called sources , drawn in red, as well as a root
O[S]S S M
Figure 1: Elementary as-graphs G want , G writer , G sleep ,and G sound . marked in bold. These markers are used to combinegraphs with the algebra’s operations. For instance,the MOD M operation in Fig. 2a combines the head G sleep with its modifier G soundly by plugging theroot of G sleep into the M-source of G soundly , see(c). That is, G soundly has now modified G sleep and(c) is our graph for sleep soundly . The other oper-ation of the AM algebra, A PP , models argumentapplication. For example, the A PP O operation inFig. 2a plugs the root of (c) into the O source of G want . Note that because G want and (c) both havean S-source, A PP O merges these nodes, see (d).The A PP S operation then fills this S-source with G writer , attaching the graph with its root, to obtainthe final graph in (b). Types.
The [ S ] annotation at the O-source of G want is a request as to what the type of the O argu-ment of G want should be. The type of an as-graph isthe set of its sources with their request annotations,so the request [ S ] means that the source set of theargument must be { S } . Because this is true for (c),the AM dependency tree is well-typed ; otherwisethe tree could not be evaluated to a graph. Thus,the graph constants lexically specify the semanticvalency of each word as well as reentrancies due toe.g. control, like in this example.If an as-graph has no sources, we say it has the empty type [ ] ; if a source in a graph printed herehas no annotation, it is assumed to have the emptyrequest (i.e. its argument must have no sources).We write τ G for the type of an as-graph G , and req α ( τ ) for the request at source α of type τ . Forexample, req O ( τ G want ) = [ S ] and req S ( τ G want ) =[ ] . If an AM dependency (sub-)tree evaluates to agraph G , we call τ G its term type . For example, thesub-tree in Fig. 2a rooted in sleep has term type [ S ] ,since it evaluates to (c).Below, we will build AM dependency trees byadding the outgoing edges of a node one by one; When evaluating an AM dependency tree, the AM algebrarestricts operation orders to ensure that every AM dependencytree evaluates to a unique graph. For instance, in Fig. 2, theA PP O edge out of “wants” is always tacitly evaluated beforethe A PP S edge. For details on this, we refer to Groschwitzet al. (2018) and Groschwitz (2019). S Figure 2: (a) An AM dependency tree with its evaluation result (b), along with two partial results (c) and (d). we track types there with the following notation.If τ and τ are the term types of AM dependencytrees t , t and (cid:96) is an operation of the AM algebra,we write (cid:96) ( τ , τ ) for the term type of the treeconstructed by adding t as an (cid:96) -child to t , i.e. byadding an (cid:96) -edge from the root of t to the rootof t (if that tree is well-typed). Intuitively, onecan see this as combining a graph of type τ (thehead) with an argument or modifier of type τ usingoperation (cid:96) ; the result then has type (cid:96) ( τ , τ ) . Groschwitz et al. (2018) approach graph parsing asfirst predicting a well-typed AM dependency treefor a sentence w , . . . , w n and then evaluating itdeterministically to obtain the graph.They train a supertag and edge-factored model ,which predicts a supertag cost c ( G, i ) for assign-ing a graph constant G to the token w i , as well as an edge cost c (cid:16) i (cid:96) −→ j (cid:17) for each potential edge fromword w i to w j with label (cid:96) . Tokens which are notpart of the AM dependency tree, like the and to inFig. 2a, are treated as if they were assigned the spe-cial graph constant ⊥ and an incoming ‘ IGNORE ’edge IGNORE −−−→ i , where represents an artificialroot. The root of the AM dependency tree ( wants in the example) is modeled as having an incomingedge ROOT −−→ i .An algorithm for AM dependency parsingsearches for the well-typed AM dependency treewhich minimizes the sum of supertag and edgecosts. Finding the lowest-cost well-typed AM de-pendency tree for a given sentence is NP-complete.Groschwitz et al. define two approximate parsingalgorithms, the ‘fixed tree decoder’ that fixes anunlabeled dependency tree first , and the ‘projectivedecoder’. Our A* parser is based on the projectivedecoder and we focus on it here. Projective decoder.
The projective decoder cir-cumvents the NP-completeness by searching forthe best projective well-typed AM dependency tree.It derives parsing items using the schema (Shieberet al., 1995) shown in Fig. 3 . Originally only the fixed tree decoder used
IGNORE and s = c ( G, i ) G (cid:54) = ⊥ ([ i, i + 1] , i, τ G ) : s Init ([ i, k ] , r, τ ) : s s (cid:48) = c ( ⊥ , k ) + c (cid:16) IGNORE −−−−→ k (cid:17) ([ i, k + 1] , r, τ ) : s + s (cid:48) Skip-R ([ i, k ] , r, τ ) : ss (cid:48) = c ( ⊥ , i −
1) + c (cid:16) IGNORE −−−−→ i − (cid:17) Skip-L ([ i − , k ] , r, τ ) : s + s (cid:48) ([ i, j ] , r , τ ) : s ([ j, k ] , r , τ ) : s τ = (cid:96) ( τ , τ ) defined s = c (cid:16) r (cid:96) −→ r (cid:17) Arc-R [ (cid:96) ] ([ i, k ] , r , τ ) : s + s + s ([ i, j ] , r , τ ) : s ([ j, k ] , r , τ ) : s τ = (cid:96) ( τ , τ ) defined s = c (cid:16) r (cid:96) −→ r (cid:17) Arc-L [ (cid:96) ] ([ i, k ] , r , τ ) : s + s + s ([1 , n + 1] , r, [ ]) : s (cid:48) = c (cid:16) ROOT −−−→ r (cid:17) ([0 , n + 1] , r, [ ]) : s + s (cid:48) Root
Figure 3: Rules for the projective and A* decoder.
Each item encodes properties of a partial AMdependency tree and has the form ([ i, k ] , r, τ ) : s ,where [ i, k ] = { j | i ≤ j < k } is the span of wordindices covered by the item, r is the index of thehead word, τ the type and s the cost. The Init ruleassigns a supertag G to a word w i . The Skip-Rand Skip-L rules extend a span without changingthe dependency derivation, effectively skipping aword by assigning it the ⊥ supertag and drawingthe corresponding ‘ IGNORE ’ edge. Finally the Arc-R and Arc-L rules, for an AM operation (cid:96) , combinetwo items covering adjacent spans by drawing anedge with label (cid:96) between their heads. Once thefull chart is computed, i.e. all items are explored,a Viterbi algorithm yields the highest scoring well-typed AM dependency tree.The projective decoder has an asymptotic run-time of O (cid:0) n (cid:1) in the sentence length n . Notation and terminology.
Below, we assumethat we obtained three fixed non-empty finite setsin training: a set Ω of types; a set C of graphconstants (the graph lexicon ) such that Ω is theset of types of the graphs in C ; and a set L ofoperations, including ROOT and
IGNORE . We write S for the set of sources occurring in C and assume ROOT edge scores; we extend the projective decoder here forconsistency. hat for every source s ∈ S , A PP s ∈ L . We write Dom ( f ) for the domain of a partial function f ,i.e. the set of objects for which f is defined. While the AM dependency parser yields strongaccuracies across multiple graphbanks, Groschwitzet al.’s algorithms are quite slow in practice. Forinstance, the projective parser needs several hoursto parse each test set in §6, which seriously limitsits practical applicability. In this section, we willspeed the projective parser up through A* search.
Our A* parser maintains an agenda of parse itemsof the projective parser. The agenda is initializedwith the items produced by the Init rule. Then weiterate over the agenda. In each step, we take theitem I from the front of the agenda and apply therules Skip-L and Skip-R to it. We also attempt tocombine I with all previously discovered items,organized in a parse chart, using the Arc-L andArc-R rules. All items thus generated are addedto the agenda and the chart. Parsing ends oncewe either take a goal item ([0 , n + 1] , r, [ ]) fromthe agenda, or (unsucessfully) when the agendabecomes empty.A* parsers derive their efficiency from their abil-ity to order the items on the agenda effectively.They sort the agenda in ascending order of esti-mated cost f ( I ) = c ( I ) + h ( I ) , where c is the costderived for the item I by the parsing rules in Fig. 3and h ( I ) is an outside estimate . The quantity h ( I ) estimates the difference in cost between I and thelowest-cost well-typed AM dependency tree t thatcontains I . An outside heuristic is admissible if itis optimistic with respect to cost, i.e. f ( I ) ≤ c ( t ) ;in this case the parser is provably optimal, i.e. thefirst goal item which is dequeued from the agendadescribes the lowest-cost parse tree. Tighter out-side estimates lead to fewer items being taken offthe agenda and thus to faster runtimes.A first trivial , but admissible baseline lets h ( I ) = 0 for all items I . This ignores the out-side part and orders items purely on their past cost.We could obtain a better outside heuristic by fol-lowing Lewis and Steedman (2014) and summingup the cost of the lowest-cost supertag for eachtoken outside of the item, i.e. h ([ i, k ] , r, τ ) = (cid:88) j (cid:54)∈ [ i,k ] min G c ( G, j ) . This heuristic is admissible because each tokenwill have some supertag selected (perhaps ⊥ ) in acomplete AM dependency tree, and its cost will beequal or higher than that of the best supertag. Both of these outside heuristics ignore the fact thatthe cost of a tree consists not only of the cost forthe supertags, but also of the cost for the edges.We can obtain tighter heuristics by taking theedges into account. Observe first that the parseitem determines the supertags and edges within itssubstring, and has designated one of the tokensas the root of the subtree it represents. For alltokens outside of the span of the item, the bestparse tree will assign both a supertag to the token(potentially ⊥ ) and an incoming edge (potentiallywith edge label ROOT or IGNORE ). Thus, we obtainan admissible edge-based heuristic by adding thelowest-cost incoming edge for each outside tokenas follows: h ([ i, k ] , r, τ ) = (cid:88) j (cid:54)∈ [ i,k ] min G c ( G, j )+min o (cid:96) −→ j c ( o (cid:96) −→ j ) Observe finally that the edge-based heuristicis still overly optimistic, in that it assumes thatarbitrarily many nodes in the tree may have in-coming
ROOT edges (when it needs to be exactlyone), and that the choice of
IGNORE and ⊥ areindependent (when a node should have an incom-ing IGNORE edge if and only if its supertag is ⊥ ).We can optimize it into the ignore-aware outsideheuristic by restricting the min operations so theyrespect these constraints. As we will see in §6, the A* parser is very efficienton the DM, PAS, and PSD corpora but still slow onEDS and AMR.Therefore, we develop a novel transition-basedalgorithm for AM dependency parsing. Inspired bythe syntactic dependency parser of Ma et al. (2018),it builds the dependency tree top-down, starting atthe root and recursively adding outgoing edges tonodes. However, for AM dependency parsing weface an additional challenge: we must assign atype to each node and ensure that the overall AMdependency tree is well-typed.We will first introduce some notation (§5.1), thenintroduce three versions of our parsing schema(§5.2-§5.4), give theoretical guarantees (§5.5) anddefine the neural model (§5.6). .1 Apply sets
The transition-based parser chooses a graph con-stant G i for each token w i ; we call its type, τ G i , the lexical type λ of w i . As we add outgoing edges to i , each outgoing A PP α operation consumes the α source of the lexical type. To produce a well-typedAM dependency tree of term type τ , the sourcesof outgoing A PP edges at i must correspond to ex-actly the apply set A ( λ, τ ) , which is defined as theset O = { o , . . . , o n } of sources such thatA PP o n ( . . . A PP o ( A PP o ( λ, τ ) , τ ) , . . . , τ n ) = τ for some types τ , . . . , τ n . That is, the apply set A ( λ, τ ) is the set of sources we need to consumeto turn λ into τ .Note that there are pairs of types for whichno such set of sources exists; e.g. the apply set A ([ ] , [ s ]) is not defined. In that case, we say that [ s ] is not apply-reachable from [ ] ; the term typemust always be apply-reachable from the lexicaltype in a well-typed tree. We are now ready to define a first version of thetransition system for our parser. The parser builds adependency tree top-down and manipulates parserconfigurations to track parsing decisions and en-sure well-typedness.A parser configuration (cid:104) E , T , A , G , S (cid:105) consistsof four partial functions E , T , A , G that map eachtoken i to the following: E ( i ) : the labeled incoming edge of i , written j (cid:96) −→ i , where j is the head and l the label; T ( i ) : the set of possible term types at i ; A ( i ) : the sources of outgoing A PP edges at i ,i.e. which sources of the apply set we havecovered; G ( i ) : the graph constant at i .These functions are partial, i.e. they may be un-defined for some nodes. S is a stack of nodes thatpotentially still need children; we call the node ontop of S the active node .The initial configuration is (cid:104)∅ , ∅ , ∅ , ∅ , ∅(cid:105) . A goalconfiguration has an empty stack and for all tokens i , it holds either that i is ignored and thus has noincoming edge, or that for some type τ and graph G , T ( i ) = { τ } , G ( i ) = G and A ( i ) = A ( τ G , τ ) ,i.e. A ( i ) must be the apply set for the lexical type τ G and the term type τ . There must be at least onetoken that is not ignored. The transition rules below read as follows: ev-erything above the line denotes preconditions onwhen the transition can be applied; for example,that T must map node i to some set T of types.The transition rule then updates the configurationby adding what is specified below the line. Anexample run is shown in Fig. 4. I NIT . An I
NIT ( i ) transition is always the firsttransition and makes i the root of the tree: E T A G S ∅ ∅ ∅ ∅ ∅ root −−→ i i (cid:55)→ { [ ] } i Fixing the term type as [ ] ensures that the overallevaluation result has no unfilled sources left. C HOOSE . If we have not yet chosen a graphconstant for the active node, we assign one withthe C
HOOSE ( τ , G ) transition: i (cid:55)→ T i / ∈ Dom ( G ) σ | ii (cid:55)→ { τ } i (cid:55)→ ∅ i (cid:55)→ G σ | i This transition may only be applied if the specificterm type τ ∈ T is apply-reachable from the newlyselected lexical type τ G . The C HOOSE operationis the only operation allowed when the active nodedoes not have a graph constant yet; therefore, italways determines the lexical type of i first , beforeany outgoing edges are added. A PPLY . Once the term type τ and graph G of theactive node i have been chosen, the A PPLY ( α, j ) operation can draw an A PP α edge to a node j thathas no incoming edge, adding j to the stack: j (cid:54)∈ Dom ( E ) i (cid:55)→ { τ } i (cid:55)→ A i (cid:55)→
G σ | ii A PP α −−−→ j j (cid:55)→ { req α ( τ G ) } i (cid:55)→ A ∪ { α } σ | i | j Here α must be a source in the apply set A ( τ G , τ ) but not in A ( i ) , i.e. be a source of G thatstill needs to be filled. Fixing the term type of j ensures the type restriction of the A PP α operation. M ODIFY . In contrast to outgoing A PP edges,which are determined by the apply set, we can addarbitrary outgoing MOD edges to the active node i . This is done with the transition M ODIFY ( β, j ) ,which draws a MOD β edge to a token j that has noincoming edge, also adding j to the stack: j (cid:54)∈ Dom ( E ) i (cid:55)→ { τ } i (cid:55)→ A i (cid:55)→
G σ | ii MOD β −−−→ j j (cid:55)→ T (cid:48) σ | i | j We require that T (cid:48) is the set of all types τ (cid:48) ∈ Ω such that all sources in τ (cid:48) (except β ) including theirrequests are already present in τ G , and req β ( τ (cid:48) ) =[ ] , reflecting constraints on the MOD operation inGroschwitz (2019). tep
E T A G S
Transition ∅ ∅ ∅ ∅ [] ROOT −−−→ wants wants (cid:55)→ { [ ] } NIT wants (cid:55)→ ∅ wants (cid:55)→ G want HOOSE [ ] , (cid:104) G want , [ s, o [ s ]] (cid:105) wants A PP s −−−→ writer writer (cid:55)→ { [ ] } wants (cid:55)→ { s } PPLY s, 25 writer (cid:55)→ ∅ writer (cid:55)→ G writer HOOSE [ ] , (cid:104) G writer , [ ] (cid:105) OP wants A PP o −−−→ sleep sleep (cid:55)→ { [ s ] } wants (cid:55)→ { s, o } PPLY o, 58 sleep (cid:55)→ ∅ sleep (cid:55)→ G sleep HOOSE [ s ] , (cid:104) G sleep , [ s ] (cid:105) sleep MOD m −−−−→ soundly soundly (cid:55)→ { [ m ] , [ s, m ] } ODIFY m, 610 soundly (cid:55)→ { [ m ] } soundly (cid:55)→ ∅ soundly (cid:55)→ G soundly HOOSE [ m ] , (cid:104) G soundly , [ m ] (cid:105) [] × P OP Figure 4: Derivation with LTF of the AM dependency tree in Fig. 2. The steps show only what changed for E , T , A and G ; the stack S is shown in full. The chosen graph constants are annotated with their lexical types. P OP . The P OP transition decides that an activenode that has all of its A PP edges will not receiveany further outgoing edges, and removes it fromthe stack. i (cid:55)→ { τ } i (cid:55)→ A ( τ G , τ ) i (cid:55)→ G σ | iσ While the above parser guarantees well-typednesswhen it completes, it can still get stuck. This is be-cause when we C
HOOSE a term type τ and lexicaltype λ for a node, we must perform A PPLY transi-tions for all sources in their apply set A ( λ, τ ) toreach a goal configuration. But every A PPLY tran-sition adds an incoming edge to a token that did nothave one before; if our choices for lexical and termtypes require more A
PPLY transitions than thereare tokens without incoming edge left, the parsercannot reach a goal configuration.To avoid this situation, we track for each con-figuration c the difference W c − O c of the number W c of tokens without an incoming edge and thenumber O c of A PPLY transitions we ‘owe’ to fillall sources. O c is obtained by summing across alltokens i the number O c ( i ) of A PP children i stillneeds. To generalize to cases in §5.4 where wemay not yet know the graph constant for i , we let K c ( i ) = { τ G c ( i ) } if i ∈ Dom ( G c ) and K c ( i ) = Ω otherwise. That is, if the graph constant G c ( i ) isnot yet defined, we assume we can choose it freelylater. Then we can define O c ( i ) = min λ ∈ K c ( i ) ,τ ∈ T c ( i ) |A ( λ, τ ) − A c ( i ) | , i.e. O c ( i ) is the minimal number of sources weneed in addition to the ones already covered in A c ( i ) in order to cover the apply set A ( λ, τ ) , as-suming we choose the lexical type λ and term type τ optimally within the current constraints. If T or A is not defined for i , we let O c ( i ) = 0 . Finally, given a type τ , an upper bound n ,and a set A of already-covered sources, we let PossL ( τ , A, n ) be the set of lexical types λ suchthat A ⊆ A ( λ, τ ) and we can reach τ from λ withA PP operations for the sources in A and at most n additional A PP operations, i.e. |A ( λ, τ ) − A | ≤ n .We prevent dead ends (see §5.5) by requiringthat C HOOSE ( τ , G ) can only be applied to a config-uration c if τ G ∈ PossL ( τ , ∅ , W c − O c ) . Then τ isapply-reachable from τ G with at most W c − O c A P - PLY transitions; this is exactly as many as we canspare. The M
ODIFY transition reduces the numberof tokens that have no incoming edge without per-forming an A
PPLY transition, so we only allow itwhen we have tokens ‘to spare’, i.e. W c − O c ≥ . The lexical type first transition system chooses thegraph constant for a token early, and then choosesoutgoing A PP edges that fit the lexical type. But ofcourse the decisions on lexical type and outgoingedges interact. Thus we also consider a transitionsystem in which the lexical type is chosen after deciding on the outgoing edges. A PPLY and M
ODIFY . We modify the A
PPLY and M
ODIFY operations from §5.3 such that theyno longer assign term types to children and donot push the child on the stack. This allows thetransition system to add outgoing edges to theactive node i without committing to types. TheA PPLY ( α, j ) transition becomes j (cid:54)∈ Dom ( E ) i (cid:55)→ T i (cid:55)→
A σ | ii A PP α −−−→ j i (cid:55)→ A ∪ { α } σ | i Because we do not yet know the types for i andthus neither the apply set A ( λ, τ ) , we cannot di-rectly check that this A PPLY transition will not leadto a dead end. Instead, we check if there are pos-sible types τ and λ with α in their apply set, byrequiring that (cid:83) τ ∈ T PossL ( τ , A ∪ { α } , W c − ison-empty (it is W c − to account for the edge weare about to add). We also keep the restriction that α / ∈ A , to avoid duplicate A PP α edges.The M ODIFY ( β, j ) transition becomes j (cid:54)∈ Dom ( E ) σ | ii MOD β −−−→ j σ | i Again, we only allow it when we have tokens ‘tospare’, i.e. W c − O c ≥ . F INISH . We then replace C
HOOSE and P OP with a single transition F INISH ( G ) , which selectsan appropriate graph constant G for the active node i and pops i off the stack, such that no more edgescan be added. i A PP αk −−−−→ j k i MOD βk −−−−→ l k i (cid:55)→ T i (cid:55)→
A σ | ii (cid:55)→ { τ } , i (cid:55)→ G σ | l | . . . | l r j k (cid:55)→ T k , j k (cid:55)→ ∅ , | j | . . . | j s l k (cid:55)→ T (cid:48) k l k (cid:55)→ ∅ F INISH ( G ) is allowed if A ( τ G , τ ) = A for some τ ∈ T , and fixes this τ as the term type. In addition,F INISH pushes the child nodes j k of all s ≥ outgoing A PP edges onto the stack and fixes theirterm types as T k = { req α k ( τ G ) } (like in A PPLY of §5.2). Similarly, F
INISH also pushes the childnodes l k of all r ≥ outgoing MOD edges ontothe stack and computes their term type sets T (cid:48) k asin the M ODIFY rule of §5.2. We push the childrenin the reverse order of when they were created, sothat they are popped off the stack in the order theedges were drawn.Finally, since C
HOOSE no longer exists, we mustset A ( i ) = ∅ during I NIT . An example run isshown in Appendix F.
We state the main correctness results here; proofsare in Appendix G. We assume for all types λ ∈ Ω and all sources α ∈ S , that the type req α ( λ ) is alsoin Ω , and that for every source β with MOD β ∈ L ,the type [ β ] is in Ω . This allows us to select lexicaltypes that do not require unexpected A PP children. Theorem 5.1 (Soundness).
Every goal configu-ration derived by LTF or LTL corresponds to awell-typed AM dependency tree.
Theorem 5.2 (Completeness).
For every well-typed AM dependency tree t , there are sequencesof LTF and LTL transitions that build t . Theorem 5.3 (No dead ends).
Every configura-tion derived by LTF or LTL can be completed to agoal configuration.
We train a neural model to predict LTF and LTLtransitions, by extending Ma et al.’s stack-pointermodel with means to predict graph constants andterm types. We phrase AM dependency parsing asfinding the most likely sequence d (1) , . . . , d ( N ) ofLTF or LTL transitions given an input sentence x ,factorized as follows: P θ ( d (1) , . . . , d ( N ) | x ) = N (cid:89) t =1 P θ ( d ( t ) | d ( HOOSE transition isnot allowed, we can draw an edge or P OP . Wescore the target j of the outgoing edge with theattention score a ( t ) j and model the probability forP OP with an artificial word at position 0 (using anattention score a ( t )0 ). In other words, we have P θ ( (cid:96), j | h ( t ) ) = a ( t ) j · P θ ( (cid:96) | h ( t ) , tos → j ) P θ ( P OP | h ( t ) ) = a ( t )0 where we score the edge label (cid:96) with a softmax: θ ( (cid:96) | h ( t ) , tos → j ) = softmax ( MLP lbl ([ h ( t ) , s j ])) (cid:96) . In this situation, C HOOSE has probability 0.In LTL, we must decide between drawing anedge and F INISH ; we score edges as in LTF andreplace the probabilities for C HOOSE and P OP with P θ ( F INISH ( G ) | h ( t ) ) = a ( t )0 P θ ( G | h ( t ) ) where P θ ( G | h ( t ) ) is as above. Training. The training objective is MLE of θ on a corpus of AM dependency trees. There areusually multiple transition sequences that lead tothe same AM dependency tree, so we follow Ma etal. and determine a canonical sequence by visitingthe children in an inside-out manner. Inference. During inference, we first decidewhether we have to C HOOSE . If not, we divideeach transition into two greedy decisions: we firstdecide, based on a ( t ) i , whether to F INISH /P OP orwhether to add an edge (and where); second wefind the graph constant (in case of F INISH ) or theedge label. To ensure well-typedness, we set theprobability of forbidden transitions to 0. Run-time complexity. The run-time complex-ity of the parser is O ( n ) : O ( n ) transitions, each ofwhich requires evaluating attention over n tokens.The code is available at https://github.com/coli-saar/am-transition-parser . Data. We evaluate on the DM, PAS, and PSDgraphbanks from the SemEval 2015 shared taskon Semantic Dependency Parsing (SDP, Oepenet al. (2015)), the EDS corpus (Flickinger et al.,2017) and the AMRBank releases LDC2015E86,LDC2017T10 and LDC2020T02 (Banarescu et al.,2013). We use the AM dependency tree decom-positions of these corpora from Lindemann et al.(2019) (L’19 for short) as training data, as well astheir pre- and post-processing pipeline (includingthe AMR post-processing bugfix published aftersubmission). We use the same hyperparametersand hardware for all experiments (see AppendicesB and C). Baselines. We compare against the fixed treeand projective decoders of Groschwitz et al. (2018),using costs computed by the model of L’19. For theprojective decoder we train with the edge existenceloss recommended by Groschwitz et al. (2018).The models use pretrained BERT embeddings (De-vlin et al., 2019) without finetuning. Table 1 compares the parsing accuracy of the A*parser (with the cost model of the projective parser)across the six graphbanks (averaged over 4 trainingruns of the model), with the Init rule restricted tothe six lowest-cost graph constants per token. Weonly report one accuracy for A* because A* searchis optimal, and thus the accuracies with different ad-missible heuristics are the same. As expected, theaccuracy is on par with L’19’s parser; it is slightlydegraded on DM, EDS and AMR, perhaps becausethese graphbanks require non-projective AM de-pendency trees for accurate parsing.Parsing times are shown in Table 2 as tokensper second. We limit the number of items thatcan be dequeued from the agenda to one millionper sentence. This makes two sentences per AMRtest set unparseable; they are given dummy single-node graphs for the accuracy evaluation. The A*parser is significantly faster than L’19’s fixed-treedecoder; even more so than the projective decoderon which it is based, with a 10x to 1000x speedup.Each SDP test set is parsed in under a minute.The speed of the A* parser is very sensitive to theaccuracy of the suppertagging model: if the parsertakes many supertags for a token off the agendabefore it finds the goal item for a well-typed tree, itwill typically deqeueue many items altogether. Onthe SDP corpora, the supertagging accuracy on thedev set is above 90%; here even the trivial heuristicis fast because it simply dequeues the best supertagfor most tokens. On AMR, the supertagging accu-racy drops to 78%; as a consequence, the A* parseris slower overall, and the more informed heuristicsyield a higher speedup. EDS is an outlier, in thatthe supertagging accuracy is 94%, but the parserstill dequeues almost three supertags per token onaverage. Why this is the case bears further study. To evaluate the transition-based parser, we extractthe graph lexicon and the type set Ω from the train-ing and development sets such that Ω includes alllexical types and term types used. We establish theassumptions of §5.5 by automatically adding up to14 graph constants per graphbank, increasing thegraph lexicon by less than 1%.The LTL parser is accurate with greedy searchand parses each test set in under a minute on theCPU and within 20 seconds on the GPU . Since See Lindemann (2020) for the GPU implementation. M PAS PSD EDS AMR 15 AMR 17 AMR 20 id F ood F id F ood F id F ood F Smatch F EDM Smatch F Smatch F Smatch FHe and Choi (2020) ♠ - - -Cai and Lam (2020) ♠ - - - - - - - -Zhang et al. (2019) ♠ ± . -FG’20 ♠ - - - - -L’19 ♠ , w/o MTL 93.9 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . A* parser ♠ ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTL ♠ , no types 88.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTL ♠ , greedy 93.7 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . beam=3 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTF ♠ , greedy 92.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . beam=3 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . Table 1: Semantic parsing accuracies (id = in domain test set; ood = out of domain test set). ♠ marks models usingBERT. L’19 are results of Lindemann et al. (2019) with fixed tree decoder (incl. post-processing bugfix). FG’20 isFern´andez-Gonz´alez and G´omez-Rodr´ıguez (2020). DM PAS PSD EDS A15 A17 A20 projective, L’19 ♠ costs 3 2 4 4 < < < ♠ fixed tree 710 97 265 542 < < < ♠ , trivial 706 2096 1235 105 < < < ♠ , edge-based 725 2105 1421 129 < < < ♠ , ignore-aware 712 2167 1318 136 < < < ♠ , GPU, greedy LTL ♠ , CPU, greedy 1,094 913 1,126 968 879 962 865beam=3 241 203 231 217 217 205 198LTF ♠ , CPU, greedy 852 791 688 673 563 424 514beam=3 145 123 96 108 100 76 78 Table 2: Avg. parsing speed in tokens/s on test sets. < indicates where parsing was interrupted due to timeout. the BERT embeddings take considerable time tocompute, parsing without BERT leads to a parsingspeed of up to 10,000 tokens per second (see Ap-pendix A). With beam search, LTL considerablyoutperforms L’19 on AMR, matching the accuracyof the fast parser of Zhang et al. (2019) on AMR 17while outperforming it by up to 3.3 points F-scoreon DM. On the other graphbanks, LTL is on parwith L’19. When evaluated without BERT, LTLoutperforms L’19 by more than 1 point F-score onmost graphbanks (see Appendix A).The LTF parser is less accurate than LTL. Beamsearch reduces or even closes the gap, perhaps be-cause it can select a better graph constant from thebeam after selecting edges.Note that accuracy drops drastically for a variantof LTL which does not enforce type constraints(“LTL, no types”) because up to 50% of the pre-dicted AM dependency trees are not well-typed andcannot be evaluated to a graph. The neural modeldoes not learn to reliably construct well-typed treesby itself; the type constraints are crucial.Overall, the accuracy of LTL is very similar toL’19, except for AMR where LTL is better. We in-vestigated this difference in performance on AMR17 and found that LTL achieves higher precision but its recall is worse for longer sentences . Wesuspect this is because LTL is not explicitly penal-ized for leaving words out of the dependency treeand thus favors shorter transition sequences. We have presented two fast and accurate algorithmsfor AM dependency parsing: an A* parser whichoptimizes Groschwitz et al.’s projective parser, anda novel transition-based parser which builds an AMdependency top-down while avoiding dead ends.The parsing speed of the A* parser differs dra-matically for the different graphbanks. In contrast,the parsing speed with the transition systems is lesssensitive to the graphbank and faster overall. Thetransition systems also achieve higher accuracy.In future work, one could make the A* parsermore accurate by extending it to non-projective de-pendency trees, especially on DM, EDS and AMR.The transition-based parser could be made more ac-curate by making bottom-up information availableto its top-down choices, e.g. with Cai and Lam’s(2020) “iterative inference” method. It would alsobe interesting to see if our method for avoidingdead ends can be applied to other formalisms withcomplex symbolic restrictions. Acknowledgments. We thank the anonymous re-viewers and the participants of the DELPH-INSummit 2020 for their helpful feedback and com-ments. We thank Rezka Leonandya for his workon an earlier version of the A* parser. This re-search was funded by the Deutsche Forschungsge-meinschaft (DFG, German Research Foundation),project KO 2916/2-2. For both parsers we model the dependence of recall onsentence length with linear regression; the slopes of the twomodels are significantly different, p < . . eferences Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract Meaning Representationfor Sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse .Tatiana Bladier, Jakub Waszczuk, Laura Kallmeyer,and J¨org Hendrik Janke. 2019. From partial neu-ral graph-based LTAG parsing towards full parsing. Computational Linguistics in the Netherlands Jour-nal , 9:3–26.Jan Buys and Phil Blunsom. 2017. Robust incremen-tal neural semantic graph parsing. In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics .Deng Cai and Wai Lam. 2020. AMR parsing via graph-sequence iterative inference. In Proceedings of theACL .Shu Cai and Kevin Knight. 2013. Smatch: an evalua-tion metric for semantic feature structures. In Pro-ceedings of the ACL .Yufei Chen, Weiwei Sun, and Xiaojun Wan. 2018. Ac-curate SHRG-based semantic parsing. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 408–418, Melbourne, Australia. Asso-ciation for Computational Linguistics.Marco Damonte, Shay B. Cohen, and Giorgio Satta.2017. An incremental parser for Abstract MeaningRepresentation. In Proceedings of the 15th Confer-ence of the European Chapter of the Association forComputational Linguistics: Volume 1 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies .Timothy Dozat and Christopher D. Manning. 2018.Simpler but more accurate semantic dependencyparsing. In Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics .Rebecca Dridan and Stephan Oepen. 2011. Parser eval-uation using elementary dependency matching. In Proceedings of the 12th International Conference onParsing Technologies , pages 225–230.Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent Neural Net-work grammars. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies . Federico Fancellu, Sorcha Gilroy, Adam Lopez, andMirella Lapata. 2019. Semantic graph parsing withrecurrent neural network DAG grammars. In Pro-ceedings of the EMNLP-IJCNLP .Daniel Fern´andez-Gonz´alez and Carlos G´omez-Rodr´ıguez. 2020. Transition-based semanticdependency parsing with pointer networks. In Proceedings of the ACL .Dan Flickinger, Jan Hajiˇc, Angelina Ivanova, MarcoKuhlmann, Yusuke Miyao, Stephan Oepen, andDaniel Zeman. 2017. Open SDP 1.2. LIN-DAT/CLARIN digital library at the Institute of For-mal and Applied Linguistics ( ´UFAL), Faculty ofMathematics and Physics, Charles University.Jonas Groschwitz. 2019. Methods for taking seman-tic graphs apart and putting them back togetheragain . Ph.D. thesis, Macquarie University and Saar-land University.Jonas Groschwitz, Meaghan Fowlie, Mark Johnson,and Alexander Koller. 2017. A constrained graph al-gebra for semantic parsing with AMRs. In Proceed-ings of the 12th International Conference on Com-putational Semantics (IWCS) .Jonas Groschwitz, Matthias Lindemann, MeaghanFowlie, Mark Johnson, and Alexander Koller. 2018.AMR Dependency Parsing with a Typed SemanticAlgebra. In Proceedings of the ACL .Han He and Jinho Choi. 2020. Establishing strongbaselines for the new decade: Sequence tagging,syntactic and semantic parsing with BERT. In TheThirty-Third International Flairs Conference .Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and Accurate Dependency Parsing Using Bidi-rectional LSTM Feature Representations. Transac-tions of the Association for Computational Linguis-tics , 4:313–327.Dan Klein and Christopher D. Manning. 2003. A* pars-ing: fast exact viterbi parse selection. In Proceed-ings of the 2003 Conference of the North AmericanChapter of the Association for Computational Lin-guistics on Human Language Technology .Alexander Koller, Stephan Oepen, and Weiwei Sun.2019. Graph-based meaning representations: De-sign and processing. Tutorial at ACL 2019.Mike Lewis and Mark Steedman. 2014. A ∗ CCG pars-ing with a supertag-factored model. In Proceedingsof the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP) , pages 990–1000, Doha, Qatar. Association for ComputationalLinguistics.Matthias Lindemann. 2020. Fast transition-based AMdependency parsing with well-typedness guarantees.MSc thesis, Saarland University.atthias Lindemann, Jonas Groschwitz, and Alexan-der Koller. 2019. Compositional semantic parsingacross graphbanks. In Proceedings of the ACL .Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng,Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. In Pro-ceedings of the ACL .Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Computational Lin-guistics , 34(4):513–553.Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Silvie Cinkov´a, Dan Flickinger, JanHajiˇc, and Zdeˇnka Ureˇsov´a. 2015. Semeval 2015task 18: Broad-coverage semantic dependency pars-ing. In Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015) .Tianze Shi and Lillian Lee. 2018. Valency-augmenteddependency parsing. In Proceedings of the EMNLP .Stuart Shieber, Yves Schabes, and Fernando Pereira.1995. Principles and implementation of deductiveparsing. Journal of Logic Programming , 24(1–2):3–36.Gisle Ytrestøl. 2011. Optimistic backtracking: a back-tracking overlay for deterministic incremental pars-ing. In Proceedings of the ACL 2011 Student Ses-sion . Association for Computational Linguistics.Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019. Broad-coverage semantic pars-ing as transduction. In Proceedings of the EMNLP-IJCNLP .Yue Zhang and Stephen Clark. 2011. Shift-reduce ccgparsing. In Proceedings of the 49th Annual Meet-ing of the Association for Computational Linguistics:Human Language Technologies - Volume 1 , HLT ’11,USA. A Additional experiments and devaccuracies Table 3 shows the results of further experiments(means and standard deviations over 4 runs). Mod-els that do not use BERT, use GloVe embeddingsof size 200. Note that we use the pre- and post-processing of Lindemann et al. (2019) in the mostrecent version, which fixes a bug in AMR post-processing .For each model trained, Table 4 shows the per-formance of one run on the development set.Table 6 shows F-scores of different versions ofSmatch on the AMR tests. See also Appendix E. B Hardware and parsing experiments All parsing experiments were performed on NvidiaTesla V100 graphics cards and Intel Xeon Gold6128 CPUs running at 3.40 GHz.We measure run-time as the sum of the GPUtime and the CPU time on a single core for allapproaches. When computing scores for A*, weuse a batch size of 512 for all graphbanks but AMR,where we use a batch size of 128. We use a batchsize of 64 for LTL and LTF for parsing on theCPU. The transition probabilites are computed onthe GPU and then transferred to the main memory.In the parsing experiments with LTL where thetransition system is run on the GPU as well, we usea batch size of 512, except for AMR, for which weuse a batch size of 256.The A* algorithm is implemented in Java andwas run on the GraalVM 20 implementation of theJVM.We run the projective parser and the fixed treeparser of Groschwitz et al. (2018) with the 6 bestsupertags. When parsing with the fixed tree parseris not completed with k supertags within 30 min-utes, we retry with k − supertags. If k = 0 , weuse a dummy graph with a single node.LTL and LTF are implemented in python and runon CPython version 3.8. C Hyperparameters and training details C.1 Scores for A* We obtain the scores by training the parser of Linde-mann et al. (2019). Since Groschwitz et al. (2018)argue that a hinge loss such as the one that L’19use might not be well-suited for the projectiveparser, we replaced it by the log-likelihood loss of see https://github.com/coli-saar/am-parser M PAS PSD EDS AMR 15 AMR 17 AMR 20 id F ood F id F ood F id F ood F Smatch F EDM Smatch F Smatch F Smatch FL’19 + CharCNN 90.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . L’19 ♠ + CharCNN 93.8 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . A* parser ♠ ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTL ♠ , no types 88.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTL, greedy 91.4 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . beam = 3 91.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTL ♠ , greedy 93.7 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . beam=3 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTF ♠ , no types 85.0 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTF, greedy 89.7 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . beam = 3 91.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . LTF ♠ , greedy 92.5 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . beam=3 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . Table 3: Fulls details of accuracies of parsers we have trained (id = in domain test set; ood = out of domain testset). ♠ marks models using BERT. L’19 is Lindemann et al. (2019) with fixed tree decoder. DM PAS PSD EDS AMR 15 AMR 17 AMR 20 F F F Smatch EDM Smatch Smatch SmatchL’19 + charCNN 91.2 91.7 80.6 88.6 84.1 71.6 72.9 73.0L’19 ♠ + charCNN 94.2 95.0 84.3 90.6 86.0 75.9 77.3 77.5A* ♠ ♠ , greedy 94.1 95.1 83.4 90.6 85.9 76.1 78.0 78.4LTF, greedy 91.0 92.4 75.7 87.1 82.8 69.2 70.7 72.1LTF ♠ , greedy 92.9 94.59 79.7 88.7 84.2 72.8 74.8 75.4 Table 4: Results on development sets. ♠ marks modelsusing BERT. L’19 is Lindemann et al. (2019) with fixedtree decoder. DM PAS PSD EDS AMR 15 AMR 17 AMR 20 LTL, greedy 1,180 1,128 1,288 1,154 1,121 1,162 1,148LTL, beam=3 257 229 243 224 234 222 210LTL, GPU, greedy LTF, greedy 957 908 755 752 672 578 572LTF, beam=3 153 126 97 113 104 91 81 Table 5: Avg. parsing speed of transition systems intokens/s on (in-domain) test sets without BERT. For re-sult with BERT, see main paper. Groschwitz et al. (2018). The development metricbased on which the model is chosen is the arith-metic mean between supertagging accuracy andlabeled attachment score.We follow Lindemann et al. (2019) in the hyper-paramters, with two exceptions: we use batch sizeof 32 instead of 64 because of memory constraintsand we add a character CNN to the model to makeit more comparable with the model of the transi-tion systems; see below for its hyperparameters.In order to tease apart the impact of the characterCNN, we include the performance of a L’19 modelwith the character CNN in Table 3. Differencesare within one standard deviation of the results ob-tained with the original architecture used in L’19. AMR 2015 AMR 2017 new F L’19 F new F L’19 FL’19 + CharCNN 70.5 ± . ± . ± . ± . L’19 ♠ + CharCNN 75.4 ± . ± . ± . ± . L’19 ♠ , w/o MTL 75.4 ± . ± . ± . ± . A* parser ♠ ± . ± . ± . ± . LTL, greedy 71.1 ± . ± . ± . ± . beam = 3 71.8 ± . ± . ± . ± . LTL ♠ , greedy 74.9 ± . ± . ± . ± . beam=3 75.7 ± . ± . ± . ± . LTF, greedy 67.7 ± . ± . ± . ± . beam = 3 71.0 ± . ± . ± . ± . LTF ♠ , greedy 71.4 ± . ± . ± . ± . beam=3 74.8 ± . ± . ± . ± . Table 6: Results on AMR test sets with different ver-sions of Smatch. L’19 F is the version that was used byLindemann et al. (2019) and new F is version 1.0.4. C.2 LTL and LTF We set the hyperparameters manually without ex-tensive hyperparameter search, mostly followingMa et al. (2018). We followed Lindemann et al.(2019) for number of hidden units and dropout inthe MLPs for predicting graph constants and forthe size of embeddings.We follow Lindemann et al. (2019) in splittingthe prediction of a graph constant into predicting adelexicalized graph constant and a lexical label.We train all LTL and LTF models for 100 epochswith Adam using a batch size of 64. We follow Maet al. (2018) in setting β , β = 0 . and the initiallearning rate to . . We don’t perform weightdecay or gradient clipping. In experiments withGloVe, we use the vectors of dimensionality 200(6B.200d) and fine-tune them. Following Ma et al.(2018), we employ a character CNN with 50 filtersof window size 3.We use the BERT large version of BERT andOS 32Characters 100NE embedding 16 Table 7: Dimensionality of embeddings used in all ex-periments. All LSTMs :LSTM hidden size (per direction) 512LSTM layer dropout 0.33LSTM recurrent dropout 0.33Encoder LSTM layers used for s MLPs before bilinear attention Layers 1Hidden units 512Activation eluDropout 0.33 Edge label model Layers 1Hidden units 256Activation tanhDropout 0.33 Table 8: Hyperparameters of LTL and LTF average the layers. The weights for the average arelearned but we do not fine-tune BERT itself.For the second encoding of the input sentence, s (cid:48) , we use a single-layer bidirectional LSTM whenusing BERT and a two-layer bidirectional LSTMwhen using GloVe. On top of x (cid:48) we perform varia-tional dropout with p = 0 . , as well as on top of s and s (cid:48) . The other hyperparameters are listed inTables 7, 8 and 9. The number of parameters of theLTL and LTF models are in table 10.Training an LTL or LTF model with BERT tookat most 24 hours, and about 10 hours for AMR 15.Training with GloVe is usually a two or three hoursshorter. D Data We use the AM dependency trees of Lindemannet al. (2019) as training data, along with their pre-processing. See their supplementary materials formore details. For completeness, Table 11 showsthe number of AM dependency trees in the trainingsets as well as the number of sentences and tokensin the test sets. Note that the heuristic approachcannot obtain AM dependency trees for all graphsin the training data but nothing is left out of the testdata. Layers 1Hidden units 1024Activation tanhDropout 0.4 Table 9: Hyperparameters used in MLPs for predictingdelexicalized constants, term types and lexical labels. LTL LTF L’19GloVe BERT GloVe BERT Glove BERTDM 67.39 61.77 69.59 63.97 19.19 8.76PAS 66.71 61.05 68.90 63.24 18.54 8.05PSD 73.95 68.15 76.40 70.60 25.84 15.15EDS 70.35 65.98 72.59 68.23 21.52 12.97AMR 15 71.49 68.34 73.88 70.73 22.07 15.34AMR 17 76.42 71.60 78.86 74.04 27.84 18.61AMR 20 82.63 75.56 85.13 78.06 35.16 22.33 Table 10: Number of trainable parameters (includingGloVe embeddings) in millions. We use the standard splits on all data sets intotraining/dev/test, again following Lindemann et al.(2019).PAS, PSD and AMR are licensed by LDC butthe DM and EDS data can be downloaded fromhttp://hdl.handle.net/11234/1-1956. E Evaluation metrics DM, PAS and PSD We compute labeled F-scorewith the evaluation toolkit that was developedfor the shared task: https://github.com/semantic-dependency-parsing/toolkit. EDS We evaluate with Smatch Cai and Knight(2013), in this implementation due to itshigh speed: github.com/Oneplus/tamr/tree/master/amr aligner/smatch andEDM (Dridan and Oepen, 2011) in the im-plementation of Buys and Blunsom (2017):https://github.com/janmbuys/DeepDeepParser. Wefollow Lindemann et al. (2019) in using Smatch asdevelopment metric. Training TestSentences AM dep. trees Sentences TokensDM 35,657 31,349 1,410 33,358PAS 35,657 31,796 1,410 33,358PSD 35,657 32,807 1,410 33,358EDS 33,964 25,680 1,410 32,306AMR 15 16,833 15,472 1,371 28,458AMR 17 36,521 33,406 1,371 28,458AMR 20 55,635 51.515 1,898 36,928 Table 11: Data statics after preprocessing.Test set is in-domain for SDP. MR We evaluate with Smatchin the original implementationhttps://github.com/snowblink14/smatch. Inthe main paper, we report results with Smatch1.0.4, which are somewhat better than with earlierversions. This also applies to the results ofLindemann et al. (2019). Table 6 shows resultswith the Smatch version that were originally usedin Lindemann et al. (2019) (Commit ad7e65 from August 2018). F Example for LTL Fig. 5 shows an example of a derivation with LTL,analogous to the one in Fig. 4. G Proofs The proofs given here follow exactly Lindemann(2020).The transition systems LTF and LTL are de-signed in such a way that they enjoy three partic-ularly important properties: soundness, complete-ness and the lack of dead ends. In this section, wephrase those guarantees in formal terms, determinewhich assumptions are needed and prove the guar-antees. It will turn out that significant assumptionsare only needed to guarantee that there are no deadends.Throughout this section we assume the type sys-tem of (Groschwitz, 2019), where types are for-mally defined as DAGs with sources as nodes, andrequests being defined via the edges.The definition of a goal condition is quite strictbut it can be shown that for LTF and LTL simplerconditions are equivalent: Lemma G.1. Let c be a configuration derived byLTF. c is a goal configuration if and only if S c isempty and G c is defined for some i . Proof. = ⇒ This follows trivially from the definition of a goalcondition. ⇐ = We have to validate that for each token l , either l is ignored and thus has no incoming edge, orthat for some type τ and graph G , T c ( l ) = { τ } , G c ( l ) = G and A c ( l ) = A ( τ G , τ ) . Addition-ally, there must be at least one token j such that T c ( j ) = { τ } , G c ( j ) = G and A c ( j ) = A ( τ G , τ ) .We first show that this latter condition holds fortoken i for which G c is defined. Notice that i musthave been on the stack and a C HOOSE transition has been applied. Since it is not on the stack any-more, a P OP transition has been applied in someconfiguration c (cid:48) where i was the active node. Thismeans that A c ( i ) = A c (cid:48) ( i ) = A ( τ G c ( i ) , τ ) with T c ( i ) = T c (cid:48) ( i ) = { τ } and thus i fulfills its part for c being a goal configuration.We assumed that c was derived by LTF, so let s be an arbitrary transition sequence that derives c from the initial state (there might be multiple).We can divide the tokens in the sentence into twogroups, based on whether they have ever been onthe stack over the course of s :• let j be an arbitrary token such that there is astate c (cid:48) produced by a prefix of the transitionsequence s where j is on the stack. Here,the same argument holds as above: since j is no longer on the stack, a P OP transitionmust have been applied which implies that A c ( j ) = A ( τ G c ( j ) , τ ) with T c ( j ) = { τ } .• let j be an arbitrary token such that there is no state c (cid:48) produced by a prefix of the transitionsequence s where j is on the stack. Clearly,such a token j does not have an incoming edgeand thus also fulfills its part. Lemma G.2. Let c be a configuration derived byLTL. c is a goal configuration if and only if S c isempty and G c is defined for some i . Proof. The proof is analogous to the proof ofLemma G.1. G.1 Soundness An important property of the transition systems isthat they are sound, that is, every AM dependencytree they derive is well-typed. Theorem G.3 (Soundness). For every goal con-figuration c derived by LTF or LTL, the AM depen-dency tree described by c is well-typed.Here, ”described by” means that we can readoff the AM dependency tree from the set of edges E c and graph constants G c . We do not need anyadditional assumptions to prove this theorem.Before we can prove the theorem we first needthe following lemma: Lemma G.4. In every configuration c derived byLTF or LTL, token i has an A PP α child if and onlyif α ∈ A c ( i ) . tep E T A G S Transition1 ∅ ∅ ∅ ∅ [] ROOT −−−→ wants wants (cid:55)→ { [ ] } wants (cid:55)→ ∅ NIT wants A PP s −−−→ writer wants (cid:55)→ { s } PPLY (s, 2)4 wants A PP o −−−→ sleep wants (cid:55)→ { s, o } PPLY (o, 5)5 writer (cid:55)→ { [ ] } , writer (cid:55)→ ∅ , wants (cid:55)→ G want INISH ( (cid:104) G want , [ s, o [ s ]] (cid:105) ) sleep (cid:55)→ { [ s ] } sleep (cid:55)→ ∅ writer (cid:55)→ G writer INISH ( (cid:104) G writer , [ ] (cid:105) )7 sleep MOD m −−−−→ soundly ODIFY (m, 6)8 soundly (cid:55)→ { [ m ] , [ s, m ] } sleep (cid:55)→ G sleep INISH ( (cid:104) G sleep , [ s ] (cid:105) )9 soundly (cid:55)→ { [ m ] } soundly (cid:55)→ G soundly F INISH ( (cid:104) G soundly , [ m ] (cid:105) ) Figure 5: Derivation with LTL of the AM dependency tree in Fig. 2. The steps show only what changed for E , T , A and G ; the stack S is shown in full. The chosen graph constants are annotated with their lexical types. Proof. The A PPLY ( α, j ) transitions in LTF andLTL always add an α -source to A c ( i ) and simul-taneously add an A PP α edge. There are no otherways to add a source to A c ( i ) or to create an A PP α edge.To prove the theorem, first observe that LTF andLTL only derive trees. Well-typedness then followsfrom applying the following lemma to the root ofthe tree in the goal configuration c : Lemma G.5. Let c be a goal configuration derivedby LTF or LTL and i be a token with T c ( i ) = { τ } .Then the subtree rooted in i is well-typed and hastype τ . Proof. By structural induction over the subtrees. Base case Since i has no children, it has noA PP children in particular, making A c ( i ) = ∅ byLemma G.4. By definition of the goal configura-tion, A c ( i ) = A ( τ G c ( i ) , τ ) . Combining this with A c ( i ) = ∅ , we deduce that τ G c ( i ) = τ using thedefinition of the apply set. Induction step Let i be a node with A PP children a , . . . , a n , attached with the edgesA PP α , . . . , A PP α n , respectively. Let i also have MOD children m , . . . , m k , attached with the edges MOD β , . . . , MOD β k , respectively. Let λ = τ G c ( i ) be the lexical type at i , and { τ } = T c ( i ) .By the definition of the apply set, i reaches termtype τ from λ if we can show for all A PP children:(i) i has an A PP α child if and only if α ∈A ( λ, τ ) (ii) if a is an A PP α child of i , then it has the termtype req α ( λ ) .(i) follows from the goal condition A c ( i ) = A ( λ, τ ) and Lemma G.4. (ii) the only way the edge i A PP α −−−→ a can be cre-ated is by the APPLY ( α, a ) transitions with i ontop of the stack. Both transition systems enforce T c ( a ) = { req α ( λ ) } . Using the inductive hypoth-esis on a , it follows that a evaluates to a graph oftype req α ( λ ) .Although the MOD children of i cannot alterthe term type of i , they could make the subtreerooted in i ill-typed. That is, for any MOD β child m that evaluates to a graph of type τ (cid:48) bythe inductive hypothesis, we have to show that τ (cid:48) − β ⊆ λ ∧ req β ( τ (cid:48) ) = [ ] . The MOD β edgewas created by a MODIFY ( β, m ) transition. The MODIFY ( β, m ) transition (in case of LTF) or thenext F INISH transition (in case of LTL) resulted ina configuration c (cid:48) , where the term types of m wererestricted in exactly that way: T c (cid:48) ( m ) = { τ ∈ Ω | τ − β ⊆ λ ∧ req α ( τ ) = [ ] } . In the derivationfrom c (cid:48) to c , a C HOOSE (LTF) or F INISH (LTL)transition must have been applied when m was ontop of the stack (because the MOD β edge was cre-ated and c is a goal configuration), which resultedin T c ( m ) = { τ (cid:48) } , where τ (cid:48) ∈ T c (cid:48) ( m ) = { τ ∈ Ω | τ − β ⊆ λ ∧ req α ( τ ) = [ ] } . This means thatthe well-typedness condition indeed also holds for τ (cid:48) . G.2 CompletenessTheorem G.6 (Completeness). For every well-typed AM dependency tree t , there are valid se-quences of LTF and LTL transitions that build ex-actly t .We do not need any additional assumptions toprove this theorem. The proof is constructive: forany well-typed AM dependency tree t , Algorithms1 and 2 give transition sequences that, when pre-fixed with an appropriate I NIT operation, generate t . We show this by showing the following lemmafor LTF): Lemma G.7. Let t be a well-typed AM depen-dency tree with term type τ whose root is r and let c be a configuration derived by LTF with(i) τ ∈ T c ( r ) ,(ii) r is on top of S c ,(iii) W c − O c ≥ | t | − , i.e. W c − O c is at leastthe number of nodes in t without the root,(iv) i / ∈ Dom ( G c ) for all nodes i of t , and(v) i / ∈ Dom ( E c ) for all nodes i (cid:54) = r of t Then H LT F ( c, t ) (Algorithm 1) constructs, withvalid LTF transitions, a configuration c (cid:48) such that(a) c (cid:48) contains the edges of t ,(b) G c (cid:48) ( i ) = G i where G i is the constant at i in t ,(c) S c (cid:48) is the same as S c but without r on top,i.e. S c = S c (cid:48) | r ,(d) W c (cid:48) = W c − ( | t | − , and(e) for all j that are not nodes of t , none of A , G , T , E changes, e.g. A c (cid:48) ( j ) = A c ( j ) .The lemma basically says that we can insert t asa subtree into a configuration c with LTF transitions.The conditions (i) and (ii) say that we have alreadyput the root of t on top of the stack and thus cannow start to add the rest of t . Condition (iii) saysthat there are enough words left in the sentence tofit t into c , where − comes from the fact that theroot of t is already on the stack and has an incomingedge. Conditions (iv) and (v) ensure that the part isstill empty where we want to put the subtree.Theorem G.6 for LTF then follows from apply-ing the lemma to the whole tree t and the configura-tion obtained after I NIT ( t ) . This yields a configura-tion with empty stack, which is a goal configuration(see Lemma G.1).Before we approach the proof of Lemma G.7,we need to show the following: Lemma G.8. Let c be a configuration derived byLTF. If for any token i , i / ∈ D ( G c ) then i / ∈ D ( A c ) . Proof. We show its contraposition: If for any token i , i ∈ D ( A c ) then i ∈ D ( G c ) . The C HOOSE tran-sition defines A c for i , and defines G c for i at thesame time. There is no transition that can remove i from D ( G c ) . Proof of Lemma G.7. By structural induction over t . Base case Let i be on top of the stack in S c . t is a leaf with graph constant G , thus W c − O c ≥ | t | − . H LT F returns the se-quence C HOOSE ( τ G , G ) , P OP . It is easily seenthat this sequence, if valid, yields a configura-tion c (cid:48) where T c (cid:48) ( i ) = { τ G } , G c (cid:48) ( i ) = G and A c (cid:48) ( i ) = A ( τ G , τ G ) = ∅ . c (cid:48) also contains all edgesof t (there are none).In order for C HOOSE ( τ G , G ) to be applica-ble, it must hold that τ G ∈ T c ( i ) (holds by(i)), i / ∈ D ( G c ) (holds by (iv)) and that τ G ∈ PossL ( τ G , ∅ , W c − O c ) , which is equivalent to |A ( τ G , τ G ) | ≤ W c − O c Since A ( τ G , τ G ) = ∅ and W c − O c ≥ , this holdswith equality. The transition C HOOSE ( τ G , G ) yields a configuration c , where A c ( i ) = A ( τ G , τ G ) = ∅ , so we can perform P OP , whichgives us the configuration c (cid:48) . Since we have notdrawn any edge W c (cid:48) = W c = W c − (1 − 1) = W c − ( | t | − . Note that these transitions have notchanged any A , G , T , E for j (cid:54) = i . Induction step Let i be on top of the stack in S c and let i in t have A PP children a , . . . , a n ,attached with the edges A PP α , . . . , A PP α n , re-spectively, where n might be . Let i in t alsohave MOD children m , . . . , m k , attached with theedges MOD β , . . . , MOD β k , respectively, where k might be as well. Let G be the constant of i in t , and τ be its term type. By well-typednessof t and the definition of the apply set, we have A ( τ G , τ ) = { α , . . . , α n } . H LT F ( t, c ) returns the sequence inFig. 6, where c is the configuration afterC HOOSE ( τ , G ) , APPLY ( α , a ) etc.For now, let us assume that conditions (i)-(v)are fulfilled for a , . . . , a n , m , . . . , m k and theirrespective configurations and that the sequence isvalid. We will verify this at a later stage.We can apply the inductive hypothesis for allchildren, which means that c (cid:48) contains the edgespresent in the subtrees a , . . . , a n , m , . . . , m k andfor all nodes j such that j is a descendant of one of a , . . . , a n , m , . . . , m k , it holds that G c (cid:48) ( j ) = G j because H LT F applied to some child of t will dothe assignment and such an assignment can neverbe changed in LTF. Assuming the above transitionsequence is valid, it is obvious that it also adds the C HOOSE ( τ , G) −−−−−−−→ c (cid:48) APPLY ( α , a ) −−−−−−−→ c H LTF ( a , c ) −−−−−−−−→ c (cid:48) . . . c (cid:48) n − APPLY ( α n , a n ) −−−−−−−−→ c n H LTF ( a n , c n ) −−−−−−−−→ c n (cid:48) c (cid:48) n MODIFY ( β , m ) −−−−−−−−−→ c n +1 H LTF ( m , c n +1 ) −−−−−−−−−−→ c (cid:48) n +1 . . . H LTF ( m k , c n + k ) −−−−−−−−−−→ c (cid:48) n + k P OP −−→ c (cid:48) (1) Figure 6: Transition sequence returned by H LT F ( t, c ) in the induction step. edges from i to a , . . . , a n , m , . . . , m k with thecorrect labels (consequence (a)) and also makesthe assignment G c (cid:48) ( i ) = G i using C HOOSE ( τ , G ) (consequence (b)).Now we go over the transition sequence in Eq. 6and check that the transitions can be applied, theconditions (i)-(v) hold and what happens to thestack.First, in order for C HOOSE ( τ G , τ ) to be ap-plicable, it must hold that τ ∈ T c ( i ) (holds by(i)), i / ∈ D ( G c ) (holds by (iv)) and that τ G ∈ PossL ( τ , ∅ , W c − O c ) , which is equivalent to |A ( τ G , τ ) | ≤ W c − O c Since A ( τ G , τ ) = { α , . . . , α n } and W c − O c ≥| t | − ≥ |{ α , . . . , α n }| = n , this holds. Thisyields a configuration c (cid:48) where T c (cid:48) ( i ) = { τ } and A c (cid:48) ( i ) = ∅ .Next, we use the transition APPLY ( α , a ) . Thisis allowed because α ∈ A ( τ G , τ ) (see above), α / ∈ A c (cid:48) ( i ) and a / ∈ D ( E c (cid:48) ) (condition (v)). Weget a new configuration c where A c ( i ) = { α } , T c ( a ) = { req α ( τ G ) } and S c = S c | a . Wenow justify why the inductive hypothesis can beused for a and c :By well-typedness of t , we know that T c ( a ) = { req α ( τ G ) } = { τ a } where τ a is the term typeof a (condition (i)). From the step before, a ison top of the stack in S c (condition (ii)). We usethe fact that j / ∈ Dom ( G c ) for all nodes j of t and j / ∈ Dom ( E c ) for all nodes j (cid:54) = i (our conditions(iv) and (v)) to justify that conditions (iv) and (v)are also met for a . What is left to verify is that W c − O c ≥ | a |− . First, note that W c = W c − because of the A PP α edge. We can decompose O c as follows: O c = O c − O c ( i ) + O c ( i ) because we have only changed G c and A c for i , notfor any other token. O c ( i ) = 0 by Lemma G.8 and i / ∈ D ( G c ) (condition (iv)). We can also see that O c ( i ) = n − by definition of O ( · ) and takinginto account that we have drawn the A PP α edgeand thus A c = { α } . This means that W c − O c = ( W c − − ( O c + n − 1) = W c − O c − n From condition (iii), we know that W c − O c ≥| t | − . Since t consists of node i and at least n children a j each of which has | a j | − nodes, wehave that | t | ≥ n + n (cid:88) j =1 ( | a j | − which is equivalent to | t | − ≥ n + n (cid:88) j =1 ( | a j | − (2)Plugging this together, we get W c − O c = W c − O c − n ≥ n (cid:88) j =1 ( | a j |− ≥ | a |− After H LT F ( a , c ) we get a configuration c (cid:48) . Wehave just argued that the inductive hypothesis ap-plies for H LT F ( a , c ) , so we can use it and findthat we are in a nearly identical situation as before APPLY ( α , a ) : The stack is S c (cid:48) = S c | a = S c .That is, in S c (cid:48) the top of the stack is i again.What has changed is W c (cid:48) − O c (cid:48) and of course A c (cid:48) = { α } , which was empty before. We cannow apply APPLY ( α , a ) and continue.Let us consider the general case for H LT F ( a l , c l ) with ≤ l ≤ n where we are in c l arriving from APPLY ( α l , a l ) . At this point, we know(i) T c l = { τ a l } where τ a l is the term type of a l (by APPLY before)(ii) i is on top of the stack (inductive hypothesisfor l (cid:48) < l )In effect, conditions (i) and (ii) for the inductive hy-pothesis for H LT F ( a l , c l ) are met. Conditions (iv)and (v) for a l are fulfilled by our assumptions (iv)and (v) because a l is a subtree of i . What remainsto be checked is W c l − O c l ≥ | a l | − . We cancalculate W c l = W c − l − (cid:80) l − j =1 ( | a j | − , wherethe summation over j comes from the inductive hy-pothesis for the children j < l and − l comes fromthe APPLY transitions we have performed. O c l issimply O c l = O c + n − l because the C HOOSE ransition resulted in O c (cid:48) = O c + n and we havedrawn l A PP edges already. Plugging this together,we get W c l − O c l = W c − l − l − (cid:88) j =1 ( | a j | − − ( O c + n − l ) ≥ ( | t | − − n − l − (cid:88) j =1 ( | a j | − ≥ n (cid:88) j = l ( | a j | − ≥ | a l | − where the first step replaces W c − O c by | t | − (assumption (iii)) and the second step replaces ( | t | − using Eq. 2.A similar line of reasoning can be used tojustify the use of the inductive hypothesis for H LT F ( m , c n +1 ) , . . . , H LT F ( m , c n + k ) .Note that by applying the inductive hypothesisto all children, we know that i is always on top ofthe stack after H LT F was applied. This justifies thefinal P OP transition, because at that point A c (cid:48) n + k = A ( τ G , τ ) . Consequence (c) follows from this P OP .We did not change any of E , A , T , G outside ofour subtree i (consequence (e)). This follows fromthe inductive hypotheses of the children and thefact that i was always on top of the stack when weperformed any transition.If we want to determine W c (cid:48) , we note that wehave drawn n + k edges and for each child ch ∈ a , . . . , a n , m , . . . m k , we know by the inductivehypothesis that this has drawn | ch | − edges. Intotal, we have W c (cid:48) = W c − n (cid:88) j =1 ( | a j | − 1) + k (cid:88) j =1 ( | m j | − − ( n + k )= W c − n (cid:88) j =1 | a j | + k (cid:88) j =1 | m j | − ( n + k ) − ( n + k )= W c − ( | t | − where the last step makes use of the fact that | t | =1 + (cid:80) nj =1 | a j | + (cid:80) kj =1 | m j | .For LTL, the same principle applies with a nearidentical lemma which only also asks that for theroot r of t , A c ( r ) = ∅ . The procedure to constructthe transition sequence is shown in Algorithm 2. Algorithm 1 Generate LTF transitions for AM de-pendency tree function H LT F ( c, t ) Let t have graph constant G and term type τ c ← C HOOSE ( τ , G )( c ) for A PP α child a of t do c ← APPLY ( α, a )( c ) c ← H LT F ( c, a ) end for for MOD β child m of t do c ← MODIFY ( β, m )( c ) c ← H LT F ( c, m ) end for c ← P OP ( c ) return c end functionAlgorithm 2 Generate LTL transitions for AM de-pendency tree function H LT L ( c, t ) Let t have graph constant G for A PP α child a of t do c ← APPLY ( α, a )( c ) end for for MOD β child m of t do c ← MODIFY ( β, m )( c ) end for c ← F INISH ( G )( c ) Let t , . . . t n be the children of t on the stack in c for i ∈ , . . . , n do c ← H LT L ( c, t i ) end for return c end functionG.3 No dead ends For both LTF and LTL, the following theorem guar-antees that we can always get a complete analysisfor a sentence: Theorem G.9 (No dead ends). If c is a configura-tion derived by LTF or LTL then there is a validsequence of transitions that brings c to a goal con-figuration c (cid:48) .Together with the soundness theorem (TheoremG.3) that every goal configuration corresponds towell-typed AM dependency tree, this means thatwe can always finish a derivation to get a well-typed AM dependency tree, no matter what thesentence is or how the transitions are scored. Theroof of Theorem G.9 is constructive both for LTFand LTL and is given below. In both cases, weproof a lemma first that there are always ”enough”words left.Theorem G.9 only holds if we make a few as-sumptions that are mild in practice. Recall that weassumed that we are given a set of graph constants C that can draw source names from a set S , a set oftypes Ω and a set of edge labels L . We now makevery explicit the following assumptions about theirrelationships: Assumption 1. For all types λ ∈ Ω , there is aconstant G ∈ C with type τ G = λ . Assumption 2. For all types λ ∈ Ω and all sourcenames α ∈ S , if req α ( λ ) is defined then req α ( λ ) ∈ Ω . Assumption 3. If MOD α ∈ Lab then [ α ] ∈ Ω . Assumption 4. For all source names α ∈ S ,A PP α ∈ Lab . Assumption 5. There are no constraints imposedon which graph constants can be assigned to a par-ticular word.The assumptions made are almost perfectly metin practice, see the main paper.In the proof of Theorem G.9 we want to usethe fact [ ] ∈ Ω ; this follows from the assumptionsabove: Lemma G.10. The empty type [ ] ∈ Ω . Proof. Assumption 2 says that for all types λ ∈ Ω and all sources α ∈ S , the type req α ( λ ) (if defined)is also a member of Ω . Since types are formallyDAGs, each type τ is either empty (that is: [ ] ) orhas a node n without outgoing edges. In the lattercase, req n ( τ ) = [ ] . G.3.1 LTF We prove a lemma that there are always at most asmany sources that we have still to fill as there arewords without incoming edges. Lemma G.11. For all configurations derived withLTF, O c ≤ W c . Proof. By structural induction over the derivation. Base case The initial state c does not define A for any token, thus O c ( i ) = 0 for all i . The numberof words without incoming edges in configuration c is W c ≥ . Therefore, (cid:80) i O c ( i ) = O c ≤ W c . Induction step Inductive hypothesis: O c ≤ W c Goal: O c (cid:48) ≤ W c (cid:48) where c (cid:48) derives in one step from c .The derivation step from c to c (cid:48) is one of: I NIT ( i ) After INIT , A c (cid:48) is not defined for any i ,thus O c = 0 . P OP This transition only changes the stack, whichdoes not affect O , so O c (cid:48) ( i ) = O c ( i ) for all i and W c (cid:48) = W c . The inductive hypothesisapplies. C HOOSE ( τ , G ) Let i be the active node. No edgewas created, thus W c (cid:48) = W c . For all j (cid:54) = i , O c (cid:48) ( j ) = O c ( j ) . We can thus write O c (cid:48) as O c (cid:48) = O c − O c ( i ) + O c (cid:48) ( i ) Since C HOOSE ( τ , G ) was applicable in c , weknow that i / ∈ D ( G c ) . By Lemma G.8 and bydefinition of PossL , we have that O c ( i ) = 0 ,so O c (cid:48) = O c + O c (cid:48) ( i ) (3)We now look into the value of O c (cid:48) ( i ) .Since C HOOSE was applied, we know that G c (cid:48) ( i ) = G , A c (cid:48) ( i ) = ∅ and that τ G ∈ PossL ( τ , ∅ , W c − O c ) , which simplifies to |A ( τ G , τ ) | ≤ W c − O c . From this follows that O c (cid:48) ( i ) = min λ (cid:48) ∈{ τ G } ,τ (cid:48) ∈ T c (cid:48) ( i ) |A ( λ (cid:48) , τ (cid:48) ) − A c (cid:48) ( i ) | ≤ W c − O c . Substituting this for O c (cid:48) ( i ) in Eq. 3, we get O c (cid:48) = O c + O c (cid:48) ( i ) ≤ O c + W c − O c = W c = W c (cid:48) A PPLY ( α, j ) Let i be the active node. Since anedge to j was created in the transition, W c (cid:48) +1 = W c . We decompose O c (cid:48) again: O c (cid:48) = O c − O c ( i ) + O c (cid:48) ( i ) Since A PPLY could be performed, we knowthat T c and G c are defined for i and let us de-note them T c ( i ) = { τ } and G c ( i ) = G . Thus, O c ( i ) = |A ( τ G , τ ) − A c ( i ) | . Since the precon-dition of APPLY said that α / ∈ A c ( i ) and A P - PLY has the effect that A c (cid:48) ( i ) = A c ( i ) ∪ { α } ,we know that O c (cid:48) ( i ) = |A ( τ G , τ ) − ( A c ( i ) ∪{ α ) }| < O c ( i ) . This means that O c (cid:48) < O c .Using the inductive hypothesis O c ≤ W c and W c (cid:48) + 1 = W c , we get O c (cid:48) < O c ≤ W c (cid:48) + 1 which means that O c (cid:48) ≤ W c (cid:48) . ODIFY ( β, j ) Let i be the active node. In Section5.3, we made the restriction that M ODIFY isonly applicable if W c − O c ≥ (4)The transition created an edge, which meansthat W c (cid:48) = W c − . O c (cid:48) depends on G c (cid:48) , A c (cid:48) and T c (cid:48) . The only thing that changed from c to c (cid:48) is that T c (cid:48) is now defined for j . How-ever, A c (cid:48) is still not defined for j , so O c (cid:48) ( j ) = O c ( j ) = 0 . This means O c (cid:48) = O c . Substitut-ing those into Eq. 4 and re-arranging, we get O c (cid:48) ≤ W c (cid:48) .We now show that there are no dead ends byshowing that for any configuration c derived byLTF, we can construct a valid sequence of tran-sitions such that the stack becomes empty. ByLemma G.1 this means that c is a goal configura-tion. We empty the stack by repeatedly applyingAlgorithm 3.In line 17, we compute the sources that we stillhave to fill in order to pop i off the stack. Weassume an arbitrary order and o j refers to one par-ticular source in o . The symbol ⊕ denotes concate-nation. Lemma G.12. For any configuration c , C LT F ( c ) (Algorithm 3) generates a valid sequence s of LTFtransitions such that ( | S c (cid:48) | < | S c | or | S c (cid:48) | = 0 ) andthere is a token i for which G c (cid:48) ( i ) is defined, where c (cid:48) is the configuration obtained by applying s to c . Proof. First, we show that G c (cid:48) is defined for some i in c (cid:48) . We make a case distinction based on in whichline the algorithm returns. If it returns in lines 4, 11or 14, it is obvious that G c (cid:48) is defined for some i .If it returns in line 26 then o is non-empty because O c ( i ) > . If o is non-empty, we use a C HOOSE transition in the for-loop. The remaining case isreturning in line 6. Note that the stack is emptybut it is not the initial configuration (otherwise, wewould have returned in line 4), so an I NIT transitionmust have been applied, which pushes a token tothe stack. Since the stack is now empty in c (cid:48) , a P OP transition must have been applied, which is onlyapplicable if G is defined for the item on top of thestack. Consequently, G c (cid:48) is defined for some i .Further, note that every path through Algorithm3 either reduces the size of the stack (one moreP OP transition than tokens pushed to the stack byA PPLY ) or keeps it effectively empty. C LT F is constructed in a way that the transitionsequence is valid. However, there are a few criticalpoints:• In line 3, we assume the existence of a graphconstant G ∈ C with τ G = [ ] . This followsfrom Lemma G.10 and Assumption 1.• In line 13, it is assumed that there exists agraph constant G ∈ C with τ G ∈ T c ( i ) . Thisgraph constant always exists because either T c ( i ) is a request (if i has an incoming A PP edge) and thus by Assumptions 1 and 4 thereis a graph constant G ∈ C , or T c ( i ) is a setof types resulting from a M ODIFY transition.Here, the existence of a suitable graph con-stant G with type τ G ∈ T c ( i ) follows fromAssumptions 1 and 3. Assumption 5 makesexplicit that there are no further constraints onhow we choose G .• In line 20, it is assumed that there exist | o | tokens without incoming edges. This is truebecause | o | = O c ( i ) ≤ (cid:80) j O c ( j ) = O c andby Lemma G.11, it follows that | o | ≤ W c ,showing that there are indeed enough tokenswithout incoming edges.• In line 24, it is assumed that A PP a i ∈ L forsome source o j ; this is guaranteed by Assump-tion 4.In summary, we can turn any configuration c derived by LTF into a goal configuration by repeat-edly applying C LT F to it until the (finite) stack isempty. By Lemma G.1, this is a goal configuration. G.3.2 LTL The proof works similarly. We first prove a similarlemma that if i is the active node, O c ( i ) ≤ W c andthen construct a function C LT L (see Algorithm 4)that produces a valid sequence of transitions thatwe repeatedly apply to reach a goal configuration. Lemma G.13. Let c be a configuration derived byLTL. If i is the active node in c , then O c ( i ) ≤ W c . Proof. By structural induction over the derivation. Base case In the initial state, the stack S c isempty, making the antecedent of the implicationfalse for all i and thus the implication true. lgorithm 3 Complete LTF sequence function C LT F ( c ) if c = (cid:104)∅ , ∅ , ∅ , ∅ , ∅(cid:105) then Let G ∈ C with τ G = [ ] . return I NIT (1) , CHOOSE ([ ] , G ) , P OP end if if S c = [] then return [] end if Let i be top of S c . if O c ( i ) = 0 then if i ∈ Dom ( G c ) then return P OP else Let G ∈ C, τ G ∈ T c ( i ) . return CHOOSE ( τ G , G ) , P OP end if end if Let o = A ( G c ( i ) , τ ) − A c ( i ) where T c ( i ) = { τ } Let ρ j = req o j ( τ G c ( i ) ) Let a , . . . , a | o | be tokens without heads s = [] for a j ∈ a , . . . , a | o | do Let G be a constant of type ρ j s = s ⊕ APPLY ( o j , a j ) , C HOOSE ( ρ j , G ) , P OP end for return s ⊕ P OP end functionInduction step Inductive Hypothesis: If i is theactive node in c , then O c ( i ) ≤ W c .Goal: If i is the active node in c (cid:48) , then O c (cid:48) ( i ) ≤ W c (cid:48) where c (cid:48) derives in one step from c .The applied transition is one of: I NIT ( i ) The previous configuration c must be theinitial configuration. Now i is the active nodein c (cid:48) and A c (cid:48) ( i ) = ∅ and T c (cid:48) ( i ) = { [ ] } and G c (cid:48) is not defined for i . Then O c (cid:48) ( i ) =min λ ∈ Ω |A ( λ, [ ]) − A c (cid:48) ( i ) | . Note that theempty type [ ] ∈ Ω by lemma G.10 and that A ([ ] , [ ]) = ∅ . Choosing λ = [ ] , we get O c (cid:48) ( i ) = 0 . I NIT ( i ) created an edge into i ,so W c (cid:48) = W c − . Since a sentence con-sists of at least one word ( W c ≥ ), we have O c (cid:48) ( i ) = 0 ≤ W c (cid:48) . A PPLY ( α, j ) Let i be the active node in c . Then,by construction of A PPLY ( α, j ) it remainsthe active node in c (cid:48) . After the transition, T c (cid:48) ( i ) = T c ( i ) , A c (cid:48) ( i ) = A c ( i ) ∪ { α } . Thus, O c (cid:48) ( i ) can be written as follows: O c (cid:48) ( i ) = min λ (cid:48) ∈ Ω ,τ (cid:48) ∈ T c ( i ) |A ( λ (cid:48) , τ (cid:48) ) − ( A c ( i ) ∪{ α } ) | Since A PPLY ( α, j ) was applicable, the pre-conditions must be fulfilled, i.e. ∃ λ ∈ Ω . ∃ τ ∈ T c ( i ) .λ ∈ PossL ( τ , A c ( i ) ∪ { α } , W c − Expanding the definition of PossL we get: A c ( i ) ∪{ α } ⊆ A ( λ, τ ) ∧|A ( λ, τ ) − ( A c ( i ) ∪ { α } ) | ≤ W c − for some λ ∈ Ω and τ ∈ T c ( i ) . If we nowchoose λ (cid:48) = λ and τ (cid:48) = τ in O c (cid:48) ( i ) , we get O c (cid:48) ( i ) ≤ |A ( λ, τ ) − ( A c ( i ) ∪{ α } ) | ≤ W c − Since W c (cid:48) = W c − , it holds that O c (cid:48) ( i ) ≤ W c (cid:48) . M ODIFY ( β, j ) Let i be the active node. It alsoremains the active node in c (cid:48) . The transitionconsumes a word, that is W c (cid:48) = W c − . How-ever, it can only be applied if W c − O c ≥ .Since O c is obtained by summing over all to-kens, O c ( i ) ≤ O c . We get: O c ( i ) ≤ O c ≤ W c − W c (cid:48) . Finally, O c (cid:48) ( i ) = O c ( i ) because noneof A , G , T changed for i during theM ODIFY ( β, j ) transition. F INISH ( G ) Let i be active node after the transi-tion, that is, in c (cid:48) . The F INISH transition pre-supposes that i has an incoming edge. Wedistinguish two cases based on the label:• i has an incoming A PP α edge. Thenwe have that T c (cid:48) ( i ) = { req α ( τ G ) } and G c (cid:48) undefined for i . Then O c (cid:48) ( i ) =min λ ∈ Ω |A ( λ, req α ( τ G )) | . By Assump-tion 2, req α ( τ G ) ∈ Ω and by definitionof the apply set A ( λ, λ ) = ∅ for all types λ , so in particular also for req α ( τ G ) ,which makes O c (cid:48) ( i ) = 0 .• i has an incoming MOD β edge. By As-sumption 3, we know that [ β ] ∈ Ω ,for which [ β ] ∈ T c (cid:48) ( i ) holds by con-struction of F INISH ( G ) . Expanding thedefinition of O c (cid:48) ( i ) , we get: O c (cid:48) ( i ) =min λ ∈ Ω ,τ (cid:48) ∈ T c (cid:48) ( i ) |A ( λ, τ (cid:48) ) | . By choos-ing λ = [ β ] = τ (cid:48) , we get O c (cid:48) ( i ) = 0 . lgorithm 4 Complete LTL sequence function C LT L ( c ) if c = (cid:104)∅ , ∅ , ∅ , ∅ , ∅(cid:105) then Let G ∈ C with τ G = [ ] return I NIT (1) , F INISH ( G ) end if if S c = [] then return [] end if Let i be top of S c . Let λ, τ be the minimizers of O c ( i ) =min λ ∈ Ω ,τ ∈ T c ( i ) |A ( λ, τ ) − A c ( i ) | if O c ( i ) = 0 then Let G ∈ C with τ G = λ return FINISH ( G ) end if Let o = A ( λ, τ ) − A c ( i ) Let ρ i = req o i ( τ G c ( i ) ) Let a , . . . , a | o | be tokens without heads s = [] for a j ∈ a , . . . , a | o | do s = s ⊕ APPLY ( o j , a j ) end for s = s ⊕ F INISH ( G ) where τ G = λ return s end function Since O c (cid:48) ( i ) = 0 , it also holds that O c (cid:48) ( i ) ≤ W c = W c (cid:48) . Lemma G.14. For a sentence with n words, avalid LTL transition sequence can contain at most n F INISH transitions. Proof. By contradiction. Assume there is a validtransition sequence s that contains m > n F INISH transitions.Since F INISH can only be applied when there issome token on the stack and there are more F INISH transitions than there are tokens, F INISH must havebeen applied twice with the same active node.Since F INISH removes the active node from thestack, i must have been pushed twice. This meansthat i has two incoming edges. When the sec-ond incoming edge was drawn into i the condi-tion i / ∈ D ( E ) was violated, which contradicts theassumption that the transition sequence s is valid. Lemma G.15. Let c be a configuration derivedby an LTL transition sequence s that contains j F INISH transitions. Then C LT L ( c ) (Algorithm 4) generates a valid sequence s (cid:48) of LTL transitions thatleads to a goal configuration c (cid:48) or s ⊕ s (cid:48) contains j + 1 F INISH transitions. Proof. We first show the main claim and then ver-ify that the generated transition sequence s (cid:48) is valid.We make a case distinction on the content of thestack in c (cid:48) . S c (cid:48) is empty We show that c (cid:48) is a goal configura-tion. In order to apply Lemma G.2, we haveto show that G c (cid:48) is defined for some token i . There is only one path through Algorithm4 that does not assign a graph constant to atoken (line 6). Returning in line 6 means thatthe stack is empty but the state is not the ini-tial state – so something has been removedfrom the stack already with a F INISH transi-tion. Consequently, G is defined for some i . S c (cid:48) is not empty Since the stack is not empty, thismeans the algorithm returns in line 12 or inline 22. Clearly, the transition sequence thatthe algorithm returns contains a F INISH tran-sition. Together with the j F INISH transitionsthat have been performed up to the configura-tion c , this makes j + 1 F INISH transitions.Algorithm 4 is constructed such that it only pro-duces valid transition sequences. However, thereare a few critical points:• Line 3 assumes the existence of a graph con-stant G ∈ C with τ G = [ ] . This follows fromLemma G.10 and Assumption 1. Assumption5 explicitly allows us to assign G to any token.• Line 9 assumes that K c ( i ) = Ω and that i ∈D ( A c ) and i ∈ D ( T c ) . This is true because i is on top of the stack. A and T are alwaysdefined for the active node in LTL. G is neverdefined for the active node in LTL.• Lines 11 and 21 assume the existence of agraph constant G ∈ C of type τ G = λ ∈ Ω ,which is guaranteed by Assumption 1. As-sumption 5 explicitly allows us to assign G toany token.• Line 16 assumes that there are at least | o | tokens without incoming edges ( W c ≥ | o | ).This is indeed the case, because | o | = O c ( i ) and O c ( i ) ≤ W c by Lemma G.13. Line 19 assumes that A PP o j ∈ L . This isguaranteed by Assumption 4.We can construct the transition sequence forwhich Theorem G.9 asks by repeatedly applying C LT L to a given configuration c . Lemma G.15shows that applying C LT L to a configuration re-sults either in a goal configuration or increases thenumber of F INISH transitions by one. Lemma G.14tells us that there is an upper bound on how manytimes we can increase the number of F