Neural Proof Nets
NNeural Proof Nets
Konstantinos Kogkalidis and
Michael Moortgat and
Richard Moot
Utrecht Institute of Linguistics OTS, Utrecht UniversityLIRMM, Université de Montpellier, CNRS {k.kogkalidis,m.j.moortgat}@uu.nl, [email protected]
Abstract
Linear logic and the linear λ -calculus have along standing tradition in the study of natu-ral language form and meaning. Among theproof calculi of linear logic, proof nets are ofparticular interest, offering an attractive geo-metric representation of derivations that is un-burdened by the bureaucratic complications ofconventional prooftheoretic formats. Build-ing on recent advances in set-theoretic learn-ing, we propose a neural variant of proof netsbased on Sinkhorn networks, which allows usto translate parsing as the problem of extract-ing syntactic primitives and permuting theminto alignment. Our methodology induces abatch-efficient, end-to-end differentiable archi-tecture that actualizes a formally grounded yethighly efficient neuro-symbolic parser. Wetest our approach on Æthel, a dataset of type-logical derivations for written Dutch, where itmanages to correctly transcribe raw text sen-tences into proofs and terms of the linear λ -calculus with an accuracy of as high as . There is a broad consensus among grammar for-malisms that the composition of form and meaningin natural language is a resource-sensitive process,with the words making up a phrase contributingexactly once to the resulting whole. The sentence“the Mad Hatter offered” is ill-formed because ofa lack of grammatical material, “offer” being a di-transitive verb; “the Cheshire Cat grinned Alice acup of tea” on the other hand is ill-formed becauseof an excess of material, which the intransitive verb“grin” cannot accommodate.Given the resource-sensitive nature of language,it comes as no surprise that Linear Logic (Gi-rard, 1987), in particular its intuitionistic versionILL, plays a central role in current logic-basedgrammar formalisms. Abstract Categorial Gram-mars and Lambda Grammars (de Groote, 2001; Muskens, 2001) use ILL “as-is” to characterizean abstract level of grammatical structure fromwhich surface form and semantic interpretationare obtained by means of compositional transla-tions. Modern typelogical grammars in the tra-dition of the Lambek Calculus (Lambek, 1958),e.g. Multimodal TLG (Moortgat, 1996), Displace-ment Calculus (Morrill, 2014), Hybrid TLG (Kub-ota and Levine, 2020), refine the type language toaccount for syntactic aspects of word order and con-stituency; ILL here is the target logic for semanticinterpretation, reached by a homomorphism relat-ing types and derivations of the syntactic calculusto their semantic counterparts.A common feature of the aforementioned for-malisms is their adoption of the parsing-as-deduction method: determining whether a phraseis syntactically well-formed is seen as the outcomeof a process of logical deduction. This logicaldeduction automatically gives rise to a programfor meaning composition, thanks to the remark-able correspondence between logical proof andcomputation known as the Curry-Howard isomor-phism (Sørensen and Urzyczyn, 2006), a naturalmanifestation of the syntax-semantics interface.The Curry-Howard λ -terms associated with deriva-tions are neutral with respect to the particular se-mantic theory one wants to adopt, accommodatingboth the truth-conditional view of formal semanticsand the vector-based distributional view (Muskensand Sadrzadeh, 2018), among others.Despite their formal appeal, grammars based onvariants of linear logic have fallen out of favourwithin the NLP community, owing to a scarcityof large-scale datasets, but also due to difficul-ties in aligning them with the established high-performance neural toolkit. Seeking to bridge thegap between formal theory and applied practice,we focus on the proof nets of linear logic, a leangraphical calculus that does away with the bureau- a r X i v : . [ c s . C L ] S e p ratic symbol-manipulation overhead characteristicof conventional prooftheoretic presentations (§2).Integrating proof nets with recent advances in neu-ral processing, we propose a novel approach tolinear logic proof search that eliminates issues com-monly associated with higher-order types and hypo-thetical reasoning, while greatly reducing the com-putational costs of structure manipulation, back-tracking and iterative processing that burden stan-dard parsing techniques (§3).Our proposed methodology relies on two keycomponents. The first is an encoder/decoder-basedsupertagger that converts raw text sentences intolinear logic judgements by dynamically construct-ing contextual type assignments, one primitive sym-bol at a time. The second is a bi-modal encoder thatcontextualizes the generated judgement in conjunc-tion with the input sentence. The contextualizedrepresentations are fed into a Sinkhorn layer, taskedwith finding the valid permutation that brings prim-itive symbol occurrences into alignment. The ar-chitecture induced is trained on labeled data, andassumes the role of a formally grounded yet highlyaccurate parser, which transforms raw text sen-tences into linear logic proofs and computationalterms of the simply typed linear λ -calculus, furtherdecorated with dependency annotations that allowreconstruction of the underlying dependency graph(§4). We briefly summarize the logical background weare assuming, starting with ILL (cid:40) , the implication-only fragment of ILL, then moving on to thedependency-enhanced version ILL (cid:40) , (cid:51) , (cid:50) which weemploy in our experimental setup. (cid:40) Formulas (or types ) of ILL (cid:40) are inductively de-fined according to the grammar below: T ::= A | T (cid:40) T Formula A is taken from a finite set of atomic for-mulas
A ⊂ T ; a complex formula T (cid:40) T isthe type signature of a transformation that applieson T ∈ T and produces T ∈ T , consuming theargument in the process. This view of formulas asnon-renewable resources makes ILL (cid:40) the logic of linear functions. We refer to Wadler (1993) for a gentle introduction.
We can present the inference rules of ILL (cid:40) to-gether with the associated linear λ -terms in NaturalDeduction format. Judgements are sequents of theform x : T , . . . , x n : T n (cid:96) M : C. The antecedentleft of the turnstile is a typing environment (or con-text ), a sequence of variables x i , each given a typedeclaration T i . These variables serve as the param-eters of a program M of type C that corresponds tothe proof of the sequent.Proofs are built from axioms x : T (cid:96) x : T withthe aid of two rules of inference: Γ (cid:96) M : T (cid:40) T ∆ (cid:96) N : T Γ , ∆ (cid:96) ( M N ) : T (cid:40) E (1) Γ , x : T (cid:96) M : T Γ (cid:96) λ x . M : T (cid:40) T (cid:40) I (2)(1) is the elimination of the implication and mod-els function application ; it proposes that if fromsome context Γ one can derive a program M of typeT (cid:40) T , and from context ∆ one can derive aprogram N of type T , then from the multiset union Γ , ∆ one can derive a term ( M N ) of type T .(2) is the introduction of the implication andmodels function abstraction ; it proposes that iffrom a context Γ together with a type declaration x : T one can derive a program term M of type T ,then from Γ alone one can derive the abstraction λ x . M , denoting a linear function of type T (cid:40) T .To obtain a grammar based on ILL (cid:40) , we con-sider the logic in combination with a lexicon , as-signing one or more type formulas to the words ofthe language. In this setting, the proof of a sequent x : T , . . . , x n : T n (cid:96) M : C constitutes an algo-rithm to compute a meaning M of type C, given bysubstituting parameters x i with lexical meanings w i . In the type lexicon, atomic types are used todenote syntactically autonomous, stand-alone units(words and phrases); e.g. NP for noun-phrase, S for sentence, etc. Function types are assigned toincomplete expressions, e.g. NP (cid:40) S for an intran-sitive verb consuming a noun-phrase to producea sentence, NP (cid:40) NP (cid:40) S for a transitive verb,etc. Higher-order types, i.e. types of order greaterthan 1, denote functions that apply to functions;these give the grammar access to hypothetical rea-soning, in virtue of the implication introductionrule. Combined with parametric polymorphism, Read (cid:40) as right-associative. O ( A ) , the order of an atomic type, equals zero; for func-tion types O ( T (cid:40) T ) = max ( O ( T ) + 1 , O ( T )) . s (cid:51) predc ADJ (cid:40) (cid:51) su NP (cid:40) S main eeuwenoud ADJ (cid:51) su NP (cid:40) S main die (cid:51) body (cid:0) (cid:51) obj PRON (cid:40) S sub (cid:1) (cid:40) (cid:50) mod ( NP (cid:40) NP ) volgen (cid:51) obj PRON (cid:40) (cid:51) su PRON (cid:40) S sub x PRON (cid:51) su PRON (cid:40) S sub ze PRONS sub (cid:51) obj
PRON (cid:40) S sub (cid:50) mod ( NP (cid:40) NP ) de (cid:50) det ( N (cid:40) NP ) strategie NNPNP
De strategie die ze volgen is eeuwenoud (cid:96) (cid:0) is ( eeuwenoud ) predc (cid:1) (cid:16)(cid:16) die (cid:0) λx obj . ( volgen x ) ( ze ) su (cid:1) body (cid:17) mod (cid:0) ( de ) det strategie (cid:1)(cid:17) su : S main Figure 1: Example derivation and Curry-Howard λ -term for the phrase De strategie die ze volgen is eeuwenoud (“The strategy that they follow is ancient”) from Æthel sample dpc-ind-001645-nl-sen.p.12.s.1_1 ,showcasing how hypothetical reasoning enables the derivation of an object-relative clause (note how the instanti-ation of variable x of type PRON followed by its subsequent abstraction creates an argument for the higher-orderfunction assigned to “die”). Judgement premises and rule names have been omitted for brevity’s sake. higher-order types eschew the need for phantomsyntactic nodes, enabling straightforward deriva-tions for apparent non-linear phenomena involvinglong-range dependencies, elliptical conjunctions,wh-movement and the like. (cid:40) , (cid:51) , (cid:50) For our experimental setup, we will be utiliz-ing the Æthel dataset, a Dutch corpus of type-logical derivations (Kogkalidis et al., 2020). Non-commutative categorial grammars in the traditionof Lambek (1958) attempt to directly capture syn-tactic fine-structure by making a distinction be-tween left- and right-directed variants of the im-plication. In order to deal with the relatively freeword order of Dutch and contrary to the former,Æthel’s type system sticks to the directionallynon-committed (cid:40) for function types, but com-pensates with two strategies for introducing syn-tactic discrimination. First, the atomic type in-ventory distinguishes between major clausal types S sub , S v1 , S main , based on the positioning of theirverbal head (clause final, clause initial, verb sec-ond, respectively). Secondly, function types areenhanced with dependency information, expressedvia a family of unary modalities (cid:51) d , (cid:50) m , with de-pendency labels d, m drawn from disjoint sets ofcomplement vs adjunct markers. The new construc-tors produce types (cid:51) d A (cid:40) B, used to denote the head of a phrase B that selects for a complement
A and assigns it the dependency role d , and types (cid:50) m ( A (cid:40) B ) , used to denote adjuncts , i.e. non-head functions that project the dependency role m upon application. Following dependency grammartradition, determiners and modifiers are treated asnon-head functions.The type enhancement induces a dependency marking on the derived λ -term, reflecting the intro-duction/elimination of the (cid:51) , (cid:50) constructors; eachdependency domain has a unique head, togetherwith its complements and possible adjuncts, de-noted by superscripts and subscripts, respectively.Figure 1 provides an example derivation and thecorresponding λ -term.A shallow dependency graph can be trivially re-constructed by traversal of the decorated λ -term,recursively establishing labeled edges along thepath from a phrasal head to the head of each of itsdependants while skipping abstractions; see Fig-ure 4 for an example.De strategie die ze volgen is eeuwenoud ROOT predcsudet mod bodysu
Figure 4: Shallow graph for the term of Figure 1.
Despite their clear computational interpretation (Gi-rard et al., 1988; Troelstra and Schwichtenberg,2000; Sørensen and Urzyczyn, 2006), proofs innatural deduction format are arduous to obtain; rea-soning with hypotheticals necessitates a mixtureof forward and backward chaining search strate-gies. The sequent calculus presentation, on theother hand, permits exhaustive proof search viapure backward chaining, but does so at the costof spurious ambiguity. Moreover, both the aboveassume a tree-like proof structure, which hinderstheir parallel processing and impairs compatibilitywith neural methods. As an alternative, we turn A (cid:40) B − A + B + A − A − A (cid:40) B + A − B Figure 2: Links for linear logic proof nets. Left/right: positive/negative implication. Center: axiom link. De + (cid:50) det − N NP strategie + N die + (cid:51) body − (cid:51) su + (cid:50) mod + PRON − S sub − NP NP ze + PRON volgen + (cid:51) obj − PRON (cid:51) su − PRON S sub is + (cid:51) predc − ADJ
11 + (cid:51) su − NP
12 + S main eeuwenoud + ADJ − S main Figure 3: Proof net corresponding to the natural deduction derivation of Figure 1, with modal markings in placeof implication arrows. Atomic types at the fringe of the formula decomposition trees are marked with superscriptindices denoting their position for ease of identification. During decoding, the proof frame is flattened as the linearsequence: (cid:2) [SOS] , (cid:50) det , N , NP , [SEP] , N , [SEP] , (cid:51) body , (cid:51) su , PRON , S sub , (cid:50) mod , NP , NP , [SEP] , PRON , [SEP] , (cid:51) obj , . . . (cid:3) our attention towards proof nets (Girard, 1987),a graphical representation of linear logic proofsthat captures hypothetical reasoning in a purelygeometric manner. Proof nets may be seen as aparallelized version of the sequent calculus or amulti-conclusion version of natural deduction andcombine the best of both words, allowing for flexi-ble and easily parallelized proof search while main-taining the 1-to-1 correspondence with the termsof the linear λ -calculus.To define ILL proof nets, we first need the auxil-iary notion of polarity . We assign positive polarityto resources we have, negative polarity to resourceswe seek. Logically, a formula with negative po-larity appears in conclusion position (right of theturnstile), whereas formulas with positive polarityappear in premise position (left of the turnstile).Given a formula and its polarity, the polarity of itssubformulas is computed as follows: for a positiveformula T (cid:40) T , T is negative and T is positive,whereas for a negative formula T (cid:40) T , T ispositive and T is negative.With respect to proof search, proof nets presenta simple but general setup as follows. (1) Begin bywriting down the formula decomposition tree for allformulas in a sequent P , . . . P n (cid:96) C, keeping trackof polarity information; the result is called a proofframe . (2) Find a perfect matching between thepositive and negative atomic formulas; the result iscalled a proof structure . (3) Finally, verify that theproof structure satisfies the correctness condition; if so, the result is a proof net .Formula decomposition is fully deterministic,with the decomposition rules shown in Figure 2.There are two logical links, denoting positive andnegative occurrences of an implication (correspond-ing to the elimination and introduction rules of nat-ural deduction, respectively). A third rule, calledthe axiom link, connects two equal formulas ofopposite polarity.To transform a proof frame into a proof struc-ture, we first need to check the count invariance property, which requires an equal count of positiveand negative occurrences for every atomic type,and then connect atoms of opposite polarity. Inprinciple, we can connect any positive atom to anynegative atom when both are of the same type; thecombinatorics of proof search lies, therefore, in theaxiom connections (the number of possible proofstructures scales factorial to the number of atoms).Not all proof structures are, however, proof nets.Validating the correctness of a proof net can bedone in linear time (Guerrini, 1999; Murawski andOng, 2000); a common approach is to attempt atraversal of the proof net, ensuring that all nodes arevisited (connectedness) and no loops exist (acyclic-ity) (Danos and Regnier, 1989). There is an appar-ent tension here between finding just a matchingof atomic formulas (which is trivial once we sat-isfy the count invariance) and finding the correctmatching, which produces not only a proof net, butalso the preferred semantic reading of the sentence.eciding the provability of a linear logic sequentis an NP-complete problem (Lincoln, 1995), evenin the simplest case where formulas are restrictedto order 1 (Kanovich, 1994). Figure 3 shows theproof net equivalent to the derivation of Figure 1. To sidestep the complexity inherent in the combi-natorics of linear logic proof search, we investigateproof net construction from a neural perspective.First, we will need to convert a sentence into aproof frame, i.e. the decomposition of a logicaljudgement of the form P , . . . P n (cid:96) C, with P i thetype of word i and C the goal type to be derived.Having obtained a correct proof frame, the problemboils down to establishing axiom links between theset of positive and negative atoms and verifyingtheir validity according to the correctness criteria.We address each of these steps via a functionally in-dependent neural module, and define Neural ProofNets as their composition.
Obtaining proof frames is a special case of su-pertagging, a common problem in NLP litera-ture (Bangalore and Joshi, 1999). Conventionalpractice treats supertagging as a discriminative se-quence labeling problem, with a neural model con-textualizing the tokens of an input sentence beforepassing them through a linear projection in orderto convert them to class weights (Xu et al., 2015;Vaswani et al., 2016). Here, instead, we adopt thegenerative paradigm (Kogkalidis et al., 2019; Bhar-gava and Penn, 2020), whereby each type is itselfperceived as a sequence of primitive symbols.Concretely, we perform a depth-first-left-firsttraversal of formula trees to convert types to prefix(Polish) notation. This converts a type to a linearsequence of symbols s ∈ V , where V = A ∪ D , theunion of atomic types and dependency-decoratedmodal markings. Proof frames can then be repre-sented by joining individual type representations,separated with an extra-logical token [SEP] denot-ing type breaks and prefixed with a special token [SOS] to denote the sequence start (see the cap-tion of Figure 3 for an example). The resultingsequence becomes the goal of a decoding processconditional on the input sentence, as implementedby a sequence-to-sequence model. Dependency decorations occur only within the scope ofan implication, so the two are merged into a single symbol forreasons of length economy.
Treating supertagging as auto-regressive decod-ing enables the prediction of any valid type in thegrammar, improving generalization and eliminat-ing the need for a strictly defined type lexicon. Fur-ther, the decoder’s comprehension of the type con-struction process can yield drastic improvementsfor beam search, allowing distinct branching pathswithin individual types. Most importantly, it grantsaccess to the atomic sub-formulas of a sequent, i.e.the primitive entities to be paired within a proof net– a quality that will come into play when consider-ing the axiom linking process later on.
The conversion of a proof frame into a proof struc-ture requires establishing a correct bijection be-tween positive and negative atoms, i.e. linkingeach positive occurrence of an atom with a singleunique negative occurrence of the same atom.We begin by first noting that each atomic formulaoccurrence within a proof frame can be assignedan identifying index according to its position in thesequence (refer to the example of Figure 3). Foreach distinct atomic type, we can then create a ta-ble with rows enumerating negative and columnsenumerating positive occurrences of that type, or-dered by their indexes. We mark cells indexinglinked occurrences and leave the rest empty; tablesfor our running example can be seen in Figure 5.The resulting tables correspond to a permutationmatrix Π A for each atomic type A, i.e. a set of ma-trices that are square, binary and doubly-stochastic,encoding the permutation over the chain (i.e. or-dered set) of negative elements that aligns themwith the chain of matching positive elements. Thiskey insight allows us to reframe automated proofsearch as learning the latent space that dictates thepermutations between disjoint and non-contiguoussub-sequences of the primitive symbols constitut-ing a decoded proof frame.Permutation matrices are discrete mathematicalobjects that are not directly attainable by neuralmodels. Their continuous relaxations are, how-ever, valid outputs, approximated by means of theSinkhorn operator (Sinkhorn, 1964). In essence,the operator and its underlying theorem state thatthe iterative normalization (alternating betweenrows and columns) of a square matrix with pos-itive entries yields, in the limit, a doubly-stochasticmatrix, the entries of which are almost binary. Putdifferently, the Sinkhorn operator gives rise to a0 Π N Π ADJ Π S main Π S sub Π PRON Π NP Figure 5: An alternative view of the axiom links of Figure 3, with tables Π N , Π ADJ , Π S main , Π S sub , Π PRON , Π NP depicting the linked indices and corresponding permutations for each atomic type in the sentence. non-linear activation function that applies on matri-ces, pushing them towards binarity and bistochas-ticity, analogous to a 2-dimensional softmax thatpreserves assignment (Mena et al., 2018). Mov-ing to the logarithmic space eliminates the positiveentry constraint and facilitates numeric stabilitythrough the log-sum-exp trick. In that setting, theSinkhorn-normalization of a real-valued square ma-trix X is defined as: Sinkhorn( X ) = lim τ →∞ exp (Sinkhorn τ ( X )) where the induction is given by: Sinkhorn ( X ) = X Sinkhorn τ ( X ) = T r (cid:18) T r (cid:16) Sinkhorn ( τ − ( X ) (cid:17) (cid:62) (cid:19) with T r the row normalization in the log-space: T r ( X ) i,j = X i,j − log N − (cid:88) r =0 e ( X r,j − max( X r, : )) Bearing the above in mind, our goal reducesto assembling a matrix for each atomic type ina proof frame, with entries containing the unnor-malized agreement scores of pairs in the cartesianproduct of positive and negative occurrences ofthat type. Given contextualized representations foreach primitive symbol within a proof frame, scorescan be simply computed as the inter-representation dot-product attention . Assuming, for instance, I + A and I − A the vectors indexing the positions of all a positive and negative occurrences of type Ain a proof frame sequence, we can arrange thematrices P A , N A ∈ R a × d containing their re-spective contextualized d -dimensional represen-tations (recall that the count invariance propertyasserts equal shapes). The dot-product attentionmatrix containing their element-wise agreementswill then be given as ˜ S A = P A N (cid:62) A ∈ R a × a . Ap-plying the Sinkhorn operator, we obtain S A =Sinkhorn( ˜ S A ) , which, in our setup, will be mod-eled as a continuous approximation of the underly-ing permutation matrix Π A . We first encode sentences us-ing BERTje (de Vries et al., 2019), a pretrainedBERT-Base model (Devlin et al., 2019) local-ized for Dutch. We then decode into proofframe sequences using a Transformer-like de-coder (Vaswani et al., 2017).
Symbol Embeddings
In order to best utilizethe small, structure-rich vocabulary of the de-coder, we opt for lower-dimensional, position-dependent symbol embeddings. We follow insightsfrom Wang et al. (2020) and embed decoder sym-bols as continuous functions in the complex space,associating each output symbol s ∈ V with a mag-nitude embedding r s ∈ R and a frequency em-bedding ω s ∈ R . A symbol s occurring in posi-tion p in the proof frame is then assigned a vector ˜ v s,p = r s e j ω s p ∈ C . We project to the decoder’svector space by concatenating the real and imag-inary parts, obtaining the final representation as v s,p = conc( (cid:60) ( ˜ v s,p ) , (cid:61) ( ˜ v s,p )) ∈ R .Tying the embedding parameters with those ofthe pre-softmax transformation reduces the net-work’s memory footprint and improves representa-tion quality (Press and Wolf, 2017). In duality tothe input embeddings, we treat output embeddingsas functionals parametric to positions. To classifya token occurring in position p , we first compute amatrix V p consisting of the local embeddings of allvocabulary symbols, V p = v : ,p ∈ R ||V||× . Thetranspose of that matrix acts then as a linear mapfrom the decoder’s representation to class weights,from which a probability distribution is obtainedby application of the softmax function. Proof Frame Contextualization
Proof framesmay generally give rise to more than one distinctproof, with only a portion of those being linguisti-cally plausible. Frames eligible to more than onepotential semantic reading can be disambiguatedby accounting for statistical preferences, as exhib-ited by lexical cues. Consequently, we need ourontextualization scheme to incorporate the senten-tial representation in its processing flow. To thatend, we employ another Transformer decoder, nowmodified to operate with no causal mask, thus al-lowing all decoded symbols to freely attend overone another regardless of their relative position.This effectively converts it into a bi-modal encoder which operates on two input sequences of differ-ent length and dimensionality, namely the BERToutput and the sequence of proof frame symbols,and constructs contextualized representations ofthe latter as informed by the former.
Axiom Linking
We index the contextualizedproof frame to obtain a pair of matrices for eachdistinct atomic type in a sentence, easing the com-plexity of the problem by preemptively dismissingthe possibility of linking unequal types; this also al-leviates performance issues noted when permutingsets of high cardinality (Mena et al., 2018). Postcontextualization, positive and negative items areprojected to a lower dimensionality via a pair offeed-forward neural functions, applied token-wise.Normalizing the dot-product attention weights be-tween the above with Sinkhorn yields our finaloutput.
We train, validate and test our architecture on thecorresponding subsets of the Æthel dataset, filter-ing out samples the proof frames of which exceed primitive symbols. Implementation details andhyper-parameter tables, an illustration of the fullarchitecture, dataset statistics and example parsesare provided in Appendix A. We train our architecture end-to-end, including allBERT parameters apart from the embedding layer,using AdamW (Loshchilov and Hutter, 2018).In order to jointly learn representations that ac-commodate both the proof-frame and the proof-structure outputs, we back-propagate a loss signalderived as the addition of two loss functions. Thefirst is the Kullback-Leibler divergence betweenthe predicted proof frame symbols and the label-smoothed ground-truth distribution (Müller et al.,2019). The second is the negative log-likelihood be-tween the Sinkhorn-activated dot-product weights The implementing code can be found at github.com/konstantinosKokos/neural-proof-nets . and the corresponding binary-valued permutationmatrices.Throughout training, we validate by measuringthe per-symbol and per-sentence typing accuracyof the greedily decoded proof frame, as well as thelinking accuracy under the assumption of an error-free decoding. We perform model selection on thebasis of the above metrics and reach convergenceafter approximately 300 epochs. We test model performance using beam search. Foreach input sentence, we consider the β best de-code paths, with a path’s score being the sum ofits symbols’ log probabilities, counting all sym-bols up to the last expected [SEP] token. Neuraldecoding is followed by a series of filtering steps.We first parse the decoded symbol sequences, dis-carding beams containing subsequences that do notmeet the inductive constructors of the type gram-mar. The atomic formulas of the passing proofframes are polarized according to the process of§2.3. Frames failing to satisfy the count invarianceproperty are also discarded. The remaining onesconstitute potential candidates for a proof structure;their primitive symbols are contextualized by thebimodal encoder, and are then used to computesoft axiom link strengths between atomic formu-las of matching types. Discretization of the outputyields a graph encoding a proof structure; we fol-low the net traversal algorithm of Lamarche (2008)to check whether it is a valid proof net, and, ifso, produce the λ -term in the process (de Grooteand Retoré, 1996). Terms generated this way con-tain no redundant abstractions, being in β -normal η -long form. Table 1 presents a breakdown of model perfor-mance at different beam widths. To evaluate modelperformance, we use the first valid beam of eachsample, defaulting to the highest scoring beam ifnone is available. On the token level, we report supertagging accuracy , i.e. the percentage of typescorrectly assigned. We further measure the percent-age of samples satisfying each of the following sen-tential metrics: 1) invariance property , a conditionnecessary for being eligible to a proof structure, 2) frame correctness , i.e. whether the decoded frameis identical to the target frame, meaning all typesassigned are the correct ones, 3) untyped term ac-curacy , i.e. whether, regardless of the proof frame, etric (%)
Beam Size β Baseline β = 1 β = 2 β = 3 β = 5 β = 7 alpino Token Level
Types Correct 85.5 91.4 92.4 93.2 93.4 56.2
Sentence Level
Invariance Correct 87.6 93.4 94.9 96.1 96.6 n/a
Frame Correct 57.6 65.3 68.0 69.6 70.2 n/a
Term Correct (w/o types) (/w types & deps)
Table 1: Test set model performance broken down by beam size, and baseline comparison. the untyped λ -term coincides with the true one,and 4) typed term accuracy , meaning that both theproof frame and the untyped term are correct.Numeric comparisons against other works in theliterature is neither our prime goal nor an easy task;the dataset utilized is fairly recent, the novelty ofour methods renders them non-trivial to adapt toother settings, and ILL-friendly categorial gram-mars are not particularly common in experimentalsetups. As a sanity check, however, and in orderto obtain some meaningful baselines, we employthe Alpino parser (Bouma et al., 2001). Alpinois a hybrid parser based on a sophisticated hand-written grammar and a maximum entropy disam-biguation model; despite its age and the domaindifference, Alpino is competitive to the state-of-the-art in UD parsing, remaining within a 2% marginto the last reported benchmark (Bouma and vanNoord, 2017; Che et al., 2018). We pair Alpinowith the extraction algorithm used to convert itsoutput into ILL (cid:40) , (cid:51) , (cid:50) derivations (Kogkalidis et al.,2020); together, the two faithfully replicate the datagenerating process our system has been trained on,modulo the manual correction phase of van Noordet al. (2013). We query Alpino for the globally op-timal parse of each sample in the test set (enforcingno time constraints), perform the conversion andlog the results in Table 1.Our model achieves remarkable performanceeven in the greedy setting, especially consideringthe rigidity of our metrics. Untyped term accuracyconveys the percentage of sentences for which thefunction-argument structure has been perfectly cap-tured. Typed term accuracy is even stricter; theadded requirement of a correct proof frame prac-tically translates to no erroneous assignments ofpart-of-speech and syntactic phrase tags or depen-dency labels. Keeping in mind that dependency information are already incorporated in the proofframe, obtaining the correct proof structure fullysubsumes dependency parsing.The filtering criteria of the previous paragraphyield significant benefits when combined withbeam search, allowing us to circumvent logicallyunsound analyses regardless of their sequencescores. It is worth noting that our metrics placethe model’s bottleneck at the supertagging ratherthan the permutation component. Term accuracyclosely follows along (and actually surpasses, inthe untyped case) frame accuracy. This is furtherevidenced when providing the ground truth typesas input to the parser, in which case term accu-racy reaches as high as . , indicative of thehigh expressive power of Sinkhorn on top of thethe bi-modal encoder’s contextualization. On thenegative side, the strong reliance on correct typeassignments means that a single mislabeled wordcan heavily skew the parse outcome, but also hintsat increasing returns from improvements in the de-coding architecture. Our work bears semblances to other neuralmethodologies related to syntactic/semantic pars-ing. Sequence-to-sequence models have been suc-cessfully employed in the past to decode directlyinto flattened representations of parse trees (Wise-man and Rush, 2016; Buys and Blunsom, 2017;Li et al., 2018). In dependency parsing literature,head selection involves building word representa-tions that act as classifying functions over otherwords (Zhang et al., 2017), similar to our dot-product weighting between atoms.Akin to graph-based parsers (Ji et al., 2019;Zhang et al., 2019), our model generates parsestructures in the form of graphs. In our case, how-ver, graph nodes correspond to syntactic primi-tives (atomic types & dependencies) rather thanwords, while the discovery of the graph structureis subject to hard constraints imposed by the de-coder’s output.Transcription to formal expressions (logicalforms, λ -terms, database queries and executableprogram instructions) has also been a prominenttheme in NLP literature, using statistical meth-ods (Zettlemoyer and Collins, 2012) or structurally-constrained decoders (Dong and Lapata, 2016;Xiao et al., 2016; Liu et al., 2018; Cheng et al.,2019). Unlike prior approaches, the decoding weemploy here is unhindered by explicit structure;instead, parsing is handled in parallel across theentire sequence by the Sinkhorn operator, which bi-ases the output towards structural correctness whilerequiring neither backtracking nor iterative process-ing. More importantly, the λ -terms we generate arenot in themselves the product of a neural decodingprocess, but rather a corollary of the isomorphic re-lation between ILL (cid:40) proofs and linear λ -calculusprograms.In machine learning literature, Sinkhorn-basednetworks have been gaining popularity as a meansof learning latent permutations of visual or syn-thetic data (Mena et al., 2018) or imposing permu-tation invariance for set-theoretic learning (Groveret al., 2019), with so far limited adoption in thelinguistic setting (Tay et al., 2020; Swanson et al.,2020). In contrast to prior applications of Sinkhornas a final classification layer, we use it over chainelement representations that have been mutuallycontextualized, rather than set elements vector-ized in isolation. Our benchmarks, combined withthe assignment-preserving property of the operator,hint towards potential benefits from adopting it ina similar fashion across other parsing tasks. We have introduced neural proof nets, a data-drivenperspective on the proof nets of ILL (cid:40) , and suc-cessfully employed them on the demanding task oftranscribing raw text to proofs and computationalterms of the linear λ -calculus. The terms construedconstitute type-safe abstract program skeletons thatare free to interpret within arbitrary domains, ful-filling the role of a practical intermediary betweentext and meaning. Used as-is, they can find di-rect application in logic-driven models of naturallanguage inference (Abzianidze, 2016). Our architecture marks a departure from otherparsing approaches, owing to the novel use of theSinkhorn operator, which renders it both fully paral-lel and backtrack-free, but also logically grounded.It is general enough to apply to a variety of gram-mar formalisms inheriting from linear logic; if aug-mented with Gumbel sampling (Mena et al., 2018),it can further a provide a probabilistic means toaccount for derivational ambiguity. Viewed as ameans of exposing deep tecto-grammatic structure,it paves the way for graph-theoretic approaches atsyntax-aware sentential meaning representations. Acknowledgements
We would like to thank the anonymous review-ers for their detailed feedback, which helped im-prove the presentation of the paper. Konstantinosand Michael are supported by the Dutch ResearchCouncil (NWO) under the scope of the project“A composition calculus for vector-based semanticmodelling with a localization for Dutch” (360-89-070).
References
Lasha Abzianidze. 2016. Natural solution to FraCaSentailment problems. In
Proceedings of the FifthJoint Conference on Lexical and Computational Se-mantics , pages 64–74, Berlin, Germany. Associationfor Computational Linguistics.Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450v1 .Srinivas Bangalore and Aravind K Joshi. 1999. Su-pertagging: An approach to almost parsing.
Com-putational linguistics , 25(2):237–265.Aditya Bhargava and Gerald Penn. 2020. Supertag-ging with CCG primitives. In
Proceedings of the5th Workshop on Representation Learning for NLP ,pages 194–204, Online. Association for Computa-tional Linguistics.Gosse Bouma and Gertjan van Noord. 2017. Increas-ing return on annotation investment: The automaticconstruction of a Universal Dependency treebankfor Dutch. In
Proceedings of the NoDaLiDa 2017Workshop on Universal Dependencies (UDW 2017) ,pages 19–26, Gothenburg, Sweden. Association forComputational Linguistics.Gosse Bouma, Gertjan van Noord, and Robert Malouf.2001. Alpino: Wide-coverage computational anal-ysis of dutch. In
Computational linguistics in theNetherlands 2000 , pages 45–59. Brill Rodopi.icolaas Govert de Bruijn. 1979. Wiskundigen, let opuw Nederlands.
Euclides , 55(juni/juli):429–435.Jan Buys and Phil Blunsom. 2017. Robust incremen-tal neural semantic graph parsing. In
Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 1215–1226.Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,and Ting Liu. 2018. Towards better UD parsing:Deep contextualized word embeddings, ensemble,and treebank concatenation. In
Proceedings of theCoNLL 2018 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies , pages55–64, Brussels, Belgium. Association for Compu-tational Linguistics.Jianpeng Cheng, Siva Reddy, Vijay Saraswat, andMirella Lapata. 2019. Learning an executable neu-ral semantic parser.
Computational Linguistics ,45(1):59–94.Vincent Danos and Laurent Regnier. 1989. The struc-ture of multiplicatives.
Archive for MathematicalLogic , 28:181–203.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Li Dong and Mirella Lapata. 2016. Language to logi-cal form with neural attention. In
Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages33–43.Jean-Yves Girard. 1987. Linear logic.
Theoreticalcomputer science , 50(1):1–101.Jean-Yves Girard, Yves Lafont, and P. Taylor. 1988.
Proofs and Types . Cambridge Tracts in TheoreticalComputer Science 7. Cambridge University Press.Philippe de Groote. 2001. Towards abstract categorialgrammars. In
Proceedings of the 39th Annual Meet-ing of the Association for Computational Linguistics ,pages 252–259.Philippe de Groote and Christian Retoré. 1996. On thesemantic readings of proof-nets. In
Proceedings For-mal grammar , pages 57–70, Prague, Czech Repub-lic. FoLLI.Aditya Grover, Eric Wang, Aaron Zweig, and StefanoErmon. 2019. Stochastic optimization of sorting net-works via continuous relaxations. In
InternationalConference on Learning Representations . Stefano Guerrini. 1999. Correctness of multiplicativeproof nets is linear. In
Fourteenth Annual IEEE Sym-posium on Logic in Computer Science , pages 454–263. IEEE Computer Science Society.Dan Hendrycks and Kevin Gimpel. 2016. Bridgingnonlinearities and stochastic regularizers with gaus-sian error linear units.Tao Ji, Yuanbin Wu, and Man Lan. 2019. Graph-baseddependency parsing with graph neural networks. In
Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 2475–2485.Max I. Kanovich. 1994. The complexity of horn frag-ments of linear logic.
Annals of Pure and AppliedLogic , 69(2-3):195–241.Konstantinos Kogkalidis, Michael Moortgat, and Te-jaswini Deoskar. 2019. Constructive type-logical su-pertagging with self-attention networks. In
Proceed-ings of the 4th Workshop on Representation Learn-ing for NLP (RepL4NLP-2019) , pages 113–123.Konstantinos Kogkalidis, Michael Moortgat, andRichard Moot. 2020. Ã ˛Ethel: Automatically ex-tracted typelogical derivations for dutch. In
Pro-ceedings of The 12th Language Resources and Eval-uation Conference , pages 5259–5268, Marseille,France. European Language Resources Association.Ysuke Kubota and Robert Levine. 2020.
Type-LogicalSyntax . MIT Press.François Lamarche. 2008. Proof nets for intuitionisticlinear logic: Essential nets. Research report, INRIANancy.Joachim Lambek. 1958. The mathematics of sentencestructure.
The American Mathematical Monthly ,65(3):154–170.Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao. 2018.Seq2seq dependency parsing. In
Proceedings ofthe 27th International Conference on ComputationalLinguistics , pages 3203–3214.Patrick Lincoln. 1995. Deciding provability of linearlogic formulas. In Jean-Yves Girard, Yves Lafont,and Laurent Regnier, editors,
Advances in LinearLogic , pages 109–122. Cambridge University Press.Jiangming Liu, Shay B Cohen, and Mirella Lapata.2018. Discourse representation structure parsing. In
Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 429–439.Ilya Loshchilov and Frank Hutter. 2018. Fixing weightdecay regularization in adam.Gonzalo Mena, David Belanger, Scott Linderman, andJasper Snoek. 2018. Learning latent permutationswith Gumbel-Sinkhorn networks. In
InternationalConference on Learning Representations .ichael Moortgat. 1996. Multimodal linguistic infer-ence.
Journal of Logic, Language and Information ,5(3/4):349–385.Glyn Morrill. 2014. A categorial type logic. In
Cate-gories and Types in Logic, Language, and Physics -Essays Dedicated to Jim Lambek on the Occasion ofHis 90th Birthday , volume 8222 of
Lecture Notes inComputer Science , pages 331–352. Springer.Rafael Müller, Simon Kornblith, and Geoffrey E Hin-ton. 2019. When does label smoothing help? In
Advances in Neural Information Processing Systems ,pages 4696–4705.Andrzej S. Murawski and C.-H. Luke Ong. 2000. Dom-inator trees and fast verification of proof nets. In
Logic in Computer Science , pages 181–191.Reinhard Muskens. 2001. Lambda grammars and thesyntax-semantics interface. In
Proceedings of the13th Amsterdam Colloquium , pages 150–155.Reinhard Muskens and Mehrnoosh Sadrzadeh. 2018.Static and dynamic vector semantics for lambda cal-culus models of natural language.
Journal of Lan-guage Modelling , 6(2):319–351.Gertjan van Noord, Gosse Bouma, Frank van Eynde,Daniel de Kok, Jelmer van der Linde, Ineke Schuur-man, Erik Tjong Kim Sang, and Vincent Vandeghin-ste. 2013. Large scale syntactic annotation of writ-ten dutch: Lassy. In
Essential speech and languagetechnology for Dutch , pages 147–164. Springer,Berlin, Heidelberg.Ofir Press and Lior Wolf. 2017. Using the output em-bedding to improve language models. In
Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 2, Short Papers , pages 157–163.Dirk Roorda. 1991.
Resource Logics: Proof-theoretical Investigations . Ph.D. thesis, Universiteitvan Amsterdam.Richard Sinkhorn. 1964. A relationship betweenarbitrary positive matrices and doubly stochasticmatrices.
The annals of mathematical statistics ,35(2):876–879.Morten Heine Sørensen and Pawel Urzyczyn. 2006.
Lectures on the Curry-Howard isomorphism . Else-vier.Kyle Swanson, Lili Yu, and Tao Lei. 2020. Rational-izing text matching: Learning sparse alignments viaoptimal transport. arXiv preprint arXiv:2005.13111 .Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. arXivpreprint arXiv:2002.11296v1 .Anne Sjerp Troelstra and Helmut Schwichtenberg.2000.
Basic Proof Theory , 2 edition, volume 43 of
Cambridge Tracts in Theoretical Computer Science .Cambridge University Press. Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and RyanMusa. 2016. Supertagging with lstms. In
Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages232–237.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Wietse de Vries, Andreas van Cranenburgh, AriannaBisazza, Tommaso Caselli, Gertjan van Noord, andMalvina Nissim. 2019. BERTje: A Dutch BERTmodel. arXiv preprint arXiv:1912.09582v1 .Philip Wadler. 1993. A taste of linear logic. In
Interna-tional Symposium on Mathematical Foundations ofComputer Science , pages 185–210. Springer.Benyou Wang, Donghao Zhao, Christina Lioma, Qi-uchi Li, Peng Zhang, and Jakob Grue Simonsen.2020. Encoding word order in complex embeddings.In
International Conference on Learning Represen-tations .Sam Wiseman and Alexander M Rush. 2016.Sequence-to-sequence learning as beam-searchoptimization. In
Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 1296–1306.Chunyang Xiao, Marc Dymetman, and Claire Gardent.2016. Sequence-based structured prediction for se-mantic parsing. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 1341–1350.Wenduan Xu, Michael Auli, and Stephen Clark. 2015.Ccg supertagging with a recurrent neural network.In
Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 2: Short Papers) , pages250–255.Luke S Zettlemoyer and Michael Collins. 2012. Learn-ing to map sentences to logical form: Structuredclassification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420v1 .Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019. AMR parsing as sequence-to-graph transduction. In
Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics , pages 80–94, Florence, Italy. Associa-tion for Computational Linguistics.Xingxing Zhang, Jianpeng Cheng, and Mirella Lapata.2017. Dependency parsing as head selection. In
Proceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 1, Long Papers , pages 665–676.
Appendix
A.1 Model
Table 2 presents model hyper-parameters, as se-lected by greedy grid search. An illustration of themodel can be seen in Figure 6.
Parameter Value
BERTje (
BERT-Base ) Feed-forward dimensionality
Feed-forward activation GELUInput/output dimensionality
Vocabulary size
30 000
Decoder Feed-forward dimensionality
Input/output dimensionality
Vocabulary size 58Bi-modal Encoder Feed-forward dimensionality
Feed-forward activation GELUInput/output dimensionality
Pre-Sinkhorn TransformationsInput/Feed-forward dimensionality
Feed-forward activation GELUOutput dimensionality Output activation LayerNorm
Table 2: Model hyper-parameters
A.2 Optimization
We train with an adaptive learning rate follow-ing Vaswani et al. (2017), such that the learningrate at optimization step i is given as: − . · min (cid:0) i − . , i · warmup _ steps − . (cid:1) For BERT parameters, learning rate is scaled by . . We freeze the oversized word embedding layerto reduce training costs and avoid overfitting. Opti-mization hyper-parameters are presented in Table 3. We provide strict teacher guidance when learn-ing axiom links, whereby the network is providedwith the original proof frame symbol sequence in-stead of the predicted one. To speed up computa-tion, positive and negative indexes are arranged per-length rather than type for each batch; this allowsus to process symbol transformations, dot-productattentions and Sinkhorn activations in parallel formany types across many sentences. During train-ing, we set the number of Sinkhorn iterations to ; lower values are more difficult to reach conver-gence with, hurting performance, whereas highervalues can easily lead to vanishing gradients, im-peding learning (Grover et al., 2019). Parameter Value
Batch size Warmup epochs Weight decay − Weight decay (BERT) LR scale (BERT) . LR scale (BERT embedding ) Dropout rate . Label smoothing . Table 3: Optimizer hyper-parameters
A.3 Data
Figure 7 presents cumulative distributions ofdataset statistics. The kept portion of the datasetcorresponds to roughly of the original, enu-merating
55 683 training, validation and test samples.
A.4 Performance
Table 4 summarizes the model’s performance interms of untyped term accuracy over the test setin the greedy setting, binned according to inputsentence lengths. Table 5 presents input-outputpairs from sample sentences not included in thedataset.
Sentence Length Total Correct (%)1 – 5
808 743 92
10 – 15
15 – 20
20 –
592 154 26
Table 4: Test set model performance broken down bysentence length. nput SentenceProof Frame Axiom Links
BERT × Mask MHAMHAFFN × SymbolEmbedderSymbolClassifier
Atom & PolarityIndexing
PositiveFFN NegativeFFNDot-Product &SinkhornMHAMHAFFN
Figure 6: Schematic diagram of the full network architecture. The supertagger (orange, left) iteratively generatesa proof frame by attending over the currently available part of it plus the full input sentence. The axiom linker(green, right) contextualizes the complete proof frame by attending over it as well as the sentence. Representationsof atomic formulas are gathered and transformed according to their polarity, and their Sinkhorn-activated dot-product attention is computed. Discretization of the result yields a permutation matrix denoting axiom links foreach unique atomic type in the proof frame. The final output is a proof structure, i.e. the pair of a proof frame andits axiom links.Figure 7: log2 -transformed cumulative distributions of symbol and word lengths, counts of atomic formulas, ma-trices and matrix sizes from the portion of the dataset trained on. ev oo r a f gaand e s t u k j e s o ve r W i s k und i g e Om gang s t aa l hadd e nh e t v oo r a l o ve r h e t s a m e n s p e lt u ss e n w oo r d e n e n f o r m u l e s . “ T h e p r ece d i ng a r ti c l e s on t h e M a t h e m a ti ca l V e r n ac u l a r m a i n l y f o c u s e don t h e i n t e r p l a yb e t w ee n w o r d s a nd f o r m u l e s . ” (( hadden :: PP (cid:40) P R ON (cid:40) S m a i n (( vooral :: PP (cid:40) PP ) m od ( over :: N P (cid:40) PP (( tussen :: WW (cid:40) N P (cid:40) N P (( en :: N P (cid:40) N P (cid:40) N P ( woprden :: N P ) c n j )( formules :: N P ) c n j ) obj ) m od (( het :: N (cid:40) N P ) d e t samenspel :: N )) ob j )) p c ( het :: P R ON ) ob j )(( over :: N P (cid:40) N P (cid:40) N P ( Wiskundige _ Omgangstaal :: N P ) ob j ) m od (( voorafgaande :: N P (cid:40) N P ) m od (( De :: N (cid:40) N P ) d e t stukjes :: N ))) s u I nh e t w i s k und i g N e d e r l and s w o r d e n v aa k d e z e lf d e f ou t e ng e m aa k t a l s i nh e t g e w on e N e d e r l and s . “ T h e s a m e m i s t a k e s a r e o f t e n m a d e i n m a t h e m a ti ca l D u t c h a s i n c o mm on D u t c h . ” ( worden :: PP A R T (cid:40) N P (cid:40) S m a i n (( vaak :: PP A R T (cid:40) PP A R T ) m od (( In :: N P (cid:40) PP A R T (cid:40) PP A R T (( wiskundig :: N P (cid:40) N P ) m od (( het :: N (cid:40) N P ) d e t Nederlands :: N )) ob j ) m od gemaakt :: PP A R T )) v c )(( dezelfde :: C P (cid:40) N (cid:40) N P ( als :: PP (cid:40) C P ( in :: N P (cid:40) PP (( gewone :: N P (cid:40) N P ) m od (( het :: N (cid:40) N P ) d e t Nederlands :: N )) ob j ) c m p_body ) ob c o m p ) d e t ( fouten :: N ) s u I nh e t w i s k und i g e t aa l g e b r u i k i s e r m ee s t a l ee n s c h e i d i ngaan t e b r e ng e n t u ss e nd eec h t e w i s k und i g e t aa l e nd e t aa l w aa r m ee w e o ve r d i e w i s k und i g e t aa l o f o ve r h e t w i s k und i g e b e d r ijf s p r eke n . “ I n m a t h e m a ti ca l d i s c ou r s e , t h e r e i s u s u a ll y a d i s ti n c ti on t ob e m a d e b e t w ee n t h e r ea l m a t h e m a ti ca ll a ngu a g ea nd t h e l a ngu a g e w it h w h i c h w e s p ea k a bou tt h e m a t h e m a ti ca ll a ngu a g e o r a bou tt h e m a t h e m a ti ca l p r ac ti ce . “ – P r ob ee rz i nn e n s t ee d sz o t e s t e ll e nda t z e a ll ee nopd e doo r d e s c h r ij ve r b e do e l d e w ij z e z ij n t e r ug t e l e z e n . “ T r y t o a l w a y s f o r m u l a t e s e n t e n ce s i n s u c h a w a y t h a tt h e y ca non l yb e r ea d i n t h e m a nn e r i n t e nd e dby t h ea u t ho r . ” – I nh e t N e d e r l and s k unn e n ve l e z i nn e n w a t v o l go r d e b e t r e ft o m g e goo i d w o r d e n . “ I n D u t c h , m a ny s e n t e n ce s ca nb e r e s t r u c t u r e d a s f a r a s o r d e r i s c on ce r n e d . ” ( kunnen :: I N F (cid:40) N P (cid:40) S m a i n ( worden :: PP A R T (cid:40) I N F (( wat :: ( P R ON (cid:40) S s ub ) (cid:40) PP A R T (cid:40) PP A R T λ x s u . (( betreft :: N (cid:40) P R ON (cid:40) S s ub ( volgorde :: N ) ob j ) x ) r e l _body ) m od (( In :: N P (cid:40) PP A R T (cid:40) PP A R T (( het :: N (cid:40) N P ) d e t Nederlands :: N ) ob j ) m od omgegooid :: PP A R T )) v c ) v c )(( vele :: N P (cid:40) N P ) m od zinnen :: N P ) s u I nh e t N e d e r l and s k unn e n v aa k t w ee z i nn e n t o t èè n k o r t e r e w o r d e n s a m e ng e t r o kke n . “ I n D u t c h , t w o s e n t e n ce s ca no f t e nb e m e r g e d i n t o a s ho r t e r on e . ” ( kunnen :: I N F (cid:40) N P (cid:40) S m a i n ( worden :: PP A R T (cid:40) I N F (( vaak :: PP A R T (cid:40) PP A R T ) m od (( In :: N P (cid:40) PP A R T (cid:40) PP A R T (( het :: N (cid:40) N P ) d e t Nederlands :: N ) ob j ) m od ( samengetrokken :: PP (cid:40) PP A R T ( tot :: N P (cid:40) PP (( `e`en :: AD J (cid:40) N P ) d e t kortere :: AD J ) ob j ) l d )) v c ) v c )(( twee :: N (cid:40) N P ) d e t zinnen :: N ) s u P opu l a i r e t aa li s v aa k m i nd e r b eve ili gd t e g e ndubb e l z i nn i gh e i ddann e tt e t aa l , e nh e t m e ng s e l v anb e i d e t a l e n i s nògg ev aa r lij ke r . “ I n f o r m a ll a ngu a g e i s o f t e n l e ss p r o t ec t e d a g a i n s t a m b i gu it y t h a n f o r m a ll a ngu a g e , a nd t h e m i x t u r e o f bo t h l a ngu a g e s i s e v e n m o r e d a ng e r ou s . ” ( en :: S m a i n (cid:40) S m a i n (cid:40) (( is :: PP A R T (cid:40) N P (cid:40) S m a i n (( minder :: C P (cid:40) PP A R T (cid:40) PP A R T ( dan :: N P (cid:40) C P (( nette :: N P (cid:40) N P ) m od taal :: N P ) c m p_body ) ob c o m p ) m od (( vaak :: PP A R T (cid:40) PP A R T ) m od ( beveiligd :: PP (cid:40) PP A R T (( tegen :: N P (cid:40) PP ( dubbelzinnighead : N P ) ob j ) p c ))) v c (( Populaire :: N P (cid:40) N P ) m od taal :: N P ) s u ) c n j )(( is :: A P (cid:40) N P (cid:40) S m a i n (( n`og :: A P (cid:40) A P ) m od gevaarlijker :: A P ) p r e d c (( van :: N P (cid:40) N P (cid:40) N P (( beide :: N (cid:40) N P ) d e t talen :: N ) ob j ) m od ( het :: N (cid:40) N P ) d e t mengsel :: N )) s u ) c n j T a b l e : G r ee dyp a r s e s o f t h e op e n i ng s e n t e n ce s o f t h e fi r s t s e v e np a r a g r a ph s o f d e B r u ij n ( ) , i n t h e f o r m o f t yp e - a ndd e p e nd e n c y - a nno t a t e d λ e xp r e ss i on s . T w oo f t h e m ( & ) y i e l dnov a li dp r oo f n e t;t h e r e m a i n i ng fi v ea r e bo t hv a li d a nd c o rr ec tt