Learning Language from a Large (Unannotated) Corpus
LLearning Language from a Large (Unannotated) Corpus
Linas Vepstas and Ben Goertzel
January 16, 2014
Abstract
A novel approach to the fully automated, unsupervised extraction of dependency grammars andassociated syntax-to-semantic-relationship mappings from large text corpora is described. The suggestedapproach builds on the authors’ prior work with the Link Grammar, RelEx and OpenCog systems, aswell as on a number of prior papers and approaches from the statistical language learning literature.If successful, this approach would enable the mining of all the information needed to power a naturallanguage comprehension and generation system, directly from a large, unannotated corpus.
Currently, the two primary methods of supplying natural language processing systems with ”content” re-garding specific languages are:1. Explicit human-coded linguistic rules,2. Supervised machine learning from human-annotated corpora.Neither of these approaches is fully satisfactory, because both rely on substantial formalized coding by experthumans. Natural language is sufficiently complex and diverse that it eludes full formalization, either in theform of hand-coded rules, or in the form of annotation of corpora. Even if, in principle, some sufficientlylarge hand-coded rule-set or annotated corpus was enough to supply an NLP system with linguistic content,it leaves open the question of operating with the higher, more abstract structures that are the outcome ofparsing: rule-sets do not address the issue of semantic content. Thus, in practice, these traditional approachesare unlikely to yield full success.The goal here is to explore the alternative: the induction of grammar and semantics by means of unsu-pervised learning algorithms. Two approaches that have been discussed and attempted, to some extent, bythe scientific community, are:1. Machine learning from large, unannotated text corpora2. Machine learning from (unannotated) data regarding spoken or textual language in non-linguisticcontexts ( e.g. texts together with pictures, or spoken language together with video and ambientaudio).Interesting ideas have been developed in both of these directions, but, so far, results have fallen far short ofthose obtained via the first two approaches.The review of [KM04] provides a summary of the state of the art in automatic grammar induction (thethird alternative listed above), as it stood a decade ago: it addresses a number of linguistic issues anddifficulties that arise in actual implementations of algorithms. It is also notable in that it builds a bridgebetween phrase-structure grammars and dependency grammars, essentially pointing out that these are moreor less equivalent, and that, in fact, significant progress can be achieved by taking on both points of viewat once. Grammar induction has progressed somewhat since this review was written, and we will mentionsome of the more recent work below; but yet, it is fair to say that there has been no truly dramatic progressin this direction. 1 a r X i v : . [ c s . C L ] J a n ere, we describe a novel approach to achieving the third alternative: automated grammar inductionby machine learning of linguistic content from a large, unannotated text corpus. The methods describedmay also be useful for the fourth alternative (incorporation of extralinguistic data in the learning system’sinputs); and could make use of content created using hand-coded rules or machine learning from annotatedcorpora. However, our focus will be on learning linguistic content from a large, unannotated text corpus.While the overall approach presented here is novel, the ideas are extensions and generalizations of theprior work of multiple authors, which will be referenced and in some cases discussed below. We believe thebody of ideas needed to enable unsupervised learning of language from large corpora has been graduallyemerging during the last decade. The approach given here has unique aspects, but also many aspects havealready been validated by the work of others.For sake of simplicity, we will deal here only with learning from written text. We believe that conceptuallysimilar methods can be applied to spoken language as well, but that this brings extra complexities that wewill avoid for the purposes of the present document. (In short: below, we represent syntactic and semanticlearning as separate but similarly structured and closely coupled learning processes. To handle speech inputthoroughly, we would suggest phonological learning as another separate, similarly structured and closelycoupled learning process.)Finally, we stress that the algorithms presented here are intended to be used in conjunction with a largecorpus, and a large amount of processing power. Without a very large corpus, some of the feedback requiredfor the learning process described would be unlikely to happen ( e.g. the ability of syntactic and semanticlearning to guide each other). We have not yet sought to estimate exactly how large a corpus would berequired, but our informal estimate is that Wikipedia might or might not be large enough, and the Web iscertainly more than enough. The rest of this paper is devoted to fleshing out, providing detail, and mounting a theoretical defense ofa rather simple, basic algorithm. Rather than getting lost in the details, it is important to keep a generalnotion of the algorithm in mind at all times. Thus, a crude sketch follows.The algorithm is as follows: Step A) Define words to be ’things’; Step B) Look for correlations betweenthings; Step C) Cluster similar things together into classes; Step D) Define a new set of things as the clustersobtained from the last step and Step E) return to step B. By correlation, it will almost always be meant’mutual information’ or ’mutual entropy’. This is a number capturing the strength of a relationship between’things’. The ’relationship’ between things will be, in general, a graph or hypergraph. However, very earlyin the iteration of the algorithm, it will be very simple: if ’things’ are words, then the ’relationship’ is pairsof words, and one starts by measuring the mutual information of pairs of words.The classification of step C is to be accomplished primarily using entropy maximization/minimizationprinciples. To illustrate with pairs of words: consider words A, B and W. Then, word A and word B shouldbe grouped together into a cluster C, if, for any word W, the total entropy of pair (word W, class C) + member (A in C) + member (B in C) is less than the total entropy of pair (word W, word A) + pair (word W,word B). If this inequality holds, then the cluster C should be formed; if not, then there is no advantage tohaving such a class, and it should be dissolved. (This example should not be taken literally, as, even for wordpairings, the actual relationships that must be considered are more complex than this. Detailed inequalitiesfor clustering are presented in an appendix).The penultimate step D makes considerable demands on technology. In order to use ’things’ observed inthe environment, one must be able to recognize those things: and so, one needs to have a pattern recognizer.However, patterns do not sit all by themselves, but, in fact, they interconnect, and so one must have away assembling them so that they match the observed input. This is the ’parser’ of natural languageprocessing. Here, in the abstract context of ’patterns’ of ’things’, its probably best to think of each patternas a puzzle-piece. A collection of puzzle pieces must then be assembled into a complete, final picture thatmore or less matches the observed input. This is essentially a “constraint satisfaction problem”, whichshould conjure up the kinds of algorithms required, as well as the difficulties one may have in recognizingpatterns in an input. For the general case, the best and most well-known algorithm for solving constraints isthe DPLL algorithm[WP-d], on which most SAT solvers[WP-b] and other systems are based. For language2earning, though, it does not seem that a jump to SAT is immediately required. Many patterns will berelatively simple, and have simple inter-dependencies, for which more basic algorithms should suffice. Theseinclude the backwards-forwards or the Viterbi algorithms, which should be fast, effective and sufficient. So,for example, in parsing a sentence, as each new word is ’heard’, a set of different relationships with thesurrounding words may be contemplated. After enough words have been heard in a row (say 5 or 7 or10), their inter-relationship should become clear, and one can move on. This is essentially the Viterbialgorithm, which discards the unlikely combinations, keeping only the most likely candidate(s). A global,Boolean-SAT-style optimization is not required: what we hear now really should not affect the parse of asentence we heard several minutes ago. However, this changes once the patterns become complex enough.So, for example, consider a set of large patterns, encoding meaning, that span multiple sentences or evenparagraphs. These large patterns may not fit together in any simple way, and may require a good amountof wrestling to assemble together in a coherent, consistent fashion. This is where the full force of a strongconstraint-satisfaction solver would be required.Implicit in step D is also a layering or recursion of pattern complexity, and the application of ’deeplearning’ principles. The groupings of the previous step have a tendency to reduce the total number of rulesor relationships needed to describe the corpus (entropy minimization); however, the complexity of the rulestends to increase. It is the sum of the (logarithm of) the number of rules, and the (logarithmic) complexitythat provides the proper metric: an “Occam’s razor” to discover the least-complex and smallest set of rulespossible. However, as confidence in a set of rules grows, and uncertainty diminishes, exceptions become moreapparent. Such exceptions encode semantic information. So, for example, commonality between the phrases“the book has ...” and “the dog has ...” suggests that “book” and “dog” can be grouped into a class “noun”.The absence of phrases such as “the book barked at the squirrel” suggests that perhaps books and dogsdiffer, and can be sub-classified as inanimate and animate objects. This re-classification has the odd effectof strengthening the original conception of “noun”, by forcing out words that were incorrectly classified asnouns in the first place. This can be viewed as a form of “deep learning” at work: a higher, more abstractlayer can serve to refine the correctness of a shallower, more concrete layer. Implicit in the algorithm is notjust an inter-dependency of rules, but also a layering or hierarchy.This, then, is the basic outline of the algorithm that is being proposed here. It should be clear, fromthis description, that it is capable of levering it’s way up from a lack of structure to a collection of complexpatterns, and doing so without any ’training data’, in an unsupervised fashion. The remainder of thisproposal is then devoted to justifying why this might be the right approach, as well as fleshing out in greaterdetail in how all this might work.
While the approach outlined here aims to learn the linguistic content of a language from textual data, itdoes not aim to learn the idea of language. Implicitly, we assume a model in which a learning systembegins with a basic ”linguistic infrastructure” indicating the various parts of a natural language and howthey generally interrelate; and it then learns the linguistic content characterizing a particular language. Inprinciple, it would also be possible to have an AI system to learn the very concept of a language and buildits own linguistic infrastructure. However, that is not the problem we address here; and we suspect such anapproach would require drastically more computational resources.The basic linguistic infrastructure assumed here includes: • A formalism for expressing grammatical (dependency) rules is assumed. – The ideas given here are not tied to any specific grammatical formalism, but we find it convenientto make use of a formalism in the style of dependency grammar[Tes59]. Taking a mathematicalperspective, different grammar formalisms can be translated into one-another, using relativelysimple rules and algorithms[KM04]. The primary difference between them is more a matter oftaste, perceived linguistic ’naturalness’, adaptability, and choice of parser algorithm. In particular,categorial grammars[KSPC13] can be converted into link grammars in a straight-forward way, and This is anchored on the psycho-linguistic observation that almost all dependencies are short, and rarely extend past thethe first few nearest neighbors.[Gib98, Tem07, Liu08, iC06] vice versa , but link grammars provide a more compact dictionary. Link grammars[ST91, ST93]are a type of dependency grammar; these, in turn, can be converted to and from phrase-structuregrammars. We believe that dependency grammars provide a more simple and natural descriptionof linguistic phenomena. We also believe that dependency grammars have a more natural fitwith maximum-entropy ideas, where a dependency relationship can be literally interpreted asthe mutual information between word-pairs[Yur98]. Dependency grammars also work well withMarkov models; dependency parsers can be implemented as Viterbi decoders. Figure 1 illustratestwo different formalisms. – The discussion below assumes the use of a formalism similar that of Link Grammar. In this theory,each word is associated with a set of ’connector disjuncts’, each connector disjunct controlling thepossible linkages that the word may take part in. A disjunct can be thought of as a jig-sawpuzzle-piece; valid syntactic word orders are those for which the puzzle-pieces can be validlyconnected. A single connector can be thought of as a single tab on a puzzle-piece (shown in figure2). Connectors are thus ’types’ X with a + or - sign indicating that they connect to the left orright. For example, a typical verb disjunct might be S − & O + indicating that a subject (a noun)is expected on the left, and an object (also a noun) is expected on the right. – Some of the discussion below assumes aspects of (Dick Hudson’s) Word Grammar[Hud84, Hud07].This theory (implicitly) uses connectors similar to those of Link Grammar, but allows each con-nector to be marked as the head of a link or not. A link then becomes an arrow from a head wordto the dependent word. – Each word is associated with a “lexical entry”; in Link Grammar, this is the set of connectordisjuncts for that word. It is usually the case that many words share a common lexical entry;for example, most common nouns are syntactically similar enough that they can all be groupedunder a single lexical entry. Conversely, a single word is allowed to have multiple lexical entries;so, for example, “saw”, the noun, will have a different lexical entry from “saw”, the past tense ofthe verb “to see”. That is, lexical entries can loosely correspond to traditional dictionary entries.Whether or not a word has multiple lexical entries is a matter of convenience, rather than a4igure 2: Link Grammar ConnectorsAn illustration of Link Grammar connectors and disjuncts. The connectors are the jigsaw-puzzle-shapedpieces; connectors are allowed to connect only when the tabs fit together. A disjunct is the entire (ordered)set of connectors for a word. As lexical entries appearing in a dictionary, the above would be written asa t h e : D+;c a t snake : D − & ( S+ o r O − );Mary : O − o r S+;ran : S − ;c h as e d S − & O+;Note that although the symbols ‘‘&’’ and ‘‘or’’ are used to write down disjuncts, these are not Booleanoperators, and do not form a Boolean algebra. They do form a non-symmetric compact closed monoidalalgebra. The diagram below illustrates puzzle pieces, assembled to form a parse:The comparable ASCII-graphics parse, from a recent version of the parser, is:+ −−−− Os −−−− ++ − Ds − + −−− Ss −−− + + − Ds − + | | | | | t h e c a t . n ch a s e d . v − d a snake . nThe additional lower-case ’ s ’ shown here ( e.g. Ds ) indicates that the link connects to a singular (not plural)noun. The words have also been decorated with parts of speech. (Images taken from [ST91].)5undamental aspect. Curiously, a single Link Grammar connector disjunct can be viewed as avery fine-grained part-of-speech. For example, S − & O + is the disjunct used for transitive verbs(verbs that take an object), while S − & O + & On + is the disjunct for ditransitive verbs (verbsthat take two objects: a direct and indirect object). These fine-grained parts-of-speech correlatereasonably well with word senses (e.g. taken from WordNet) and can thus serve as a rough (andvery rapid!) suggestion for word-sense disambiguation. In this way, disjuncts are a stepping stoneto the semantic meaning of a word. • A parser, for extracting syntactic structure from sentences, is assumed. What’s more, it is assumedthat the parser is capable of using dynamic criteria, such as semantic relationships, to guide parsing. – The statement here is about the type of functionality needed from the parsing component. Tra-ditional parsers presume a static, fixed lexis, and do not provide any mechanism by which parseranking can be adjusted or steered, on-the-fly, by mechanisms outside of the domain of the parseritself (“deep learning”). Yet, such external influence seems centrally important to a realisticsystem. – A paradigmatic example of such a parser is the “Viterbi Link Parser”, currently under developmentfor use with the Link Grammar. This parser is currently operational in a simple form. The namerefers to its use of the general ideas of the Viterbi algorithm. This algorithm seems biologicallyplausible, in that it applies only a local analysis of sentence structure, of limited scope, as opposedto a global optimization, thus roughly emulating the process of human listening. The currentset of legal parses of a sentence is pruned incrementally and probabilistically, based on flexiblecriteria. Although the core criteria are meant to be the traditional grammatical dependency rulestaken from the lexis, they need not be limited to this. Thus, criteria that can sway parse likelihoodpotentially include the semantic classes and roles extracted from a partial parse obtained at a givenpoint in time; such semantic relationships are typically not present in a traditional syntactic lexis.Dynamic likelihood criteria also allows for parsing to be guided by inter-sentence relationships,such as pronoun resolution, to disambiguate otherwise ambiguous sentences. – Inherent in parsing is assigning a likelihood or probability to a given parse. This probability isassembled from several sources, including an inherent strength of relationships (some disjunctsare inherently more appropriate than others), structural constraints (long range relationships andlink-crossings are disfavored) as well as a combinatorial entropy of possible choices (“
Sri Lanka ”is a set phrase). • A formalism for expressing semantic relationships is assumed. – A semantic relationship generalizes the notion of a lexical entry to allow for changes of word order,paraphrasing, tense, number, the presence or absence of modifiers, etc.
An example of such arelationship would be eat(X, Y) – indicating the eating of some entity Y by some entity X. Thisabstracts into common form several different syntactic expressions: “
Ben ate a cookie ”, “
A cookiewill be eaten by Ben ”, “
Ben sat, eating cookies ”. – Nothing particularly special is assumed here regarding semantic relationships, beyond a basicpredicate-argument structure[WP-e, WP-a]. It is assumed that predicates can have argumentsthat are other predicates, and not just atomic terms; this has an explicit impact on how predicatesand arguments are represented. A “semantic representation” of a sentence is a network of arrows(defining predicates and arguments), each arrow or a small subset of arrows defining a “semanticrelationship”. However, the beginning or end of an arrow is not necessarily a single node, butmay land on a subgraph. Because arrows may point from to to subgraphs, the resulting structureitself is no longer a graph in the proper sense, but a hypergraph. – Type constraints seem reasonable, but its not clear if these must be made explicit, or if they arethe implicit result of learning. Thus, eat(X, Y) requires that X and Y both be entities, and not,for example, actions or prepositions. Again, this is anchored on the observation that almost all dependencies are short, and rarely extend past the the first fewnearest neighbors.[Gib98, Tem07, Liu08, iC06] A formalism for expressing topics, themes and discourse structure is assumed. – Topics, themes and general discourse structure are concepts that are probed at the more abstractlevels of linguistic structure. A particularly appealing form of these, at least from the algorithmic,computer-science perspective, is Mel’ˇcuk’s Meaning-Text Theory (MTT).[MP87, Kah03] The rea-son for this appeal is that the theory is based on a sequence of transformations on graphs (someof these apparently being ’natural transformations’ in the category-theoretic sense), and thusamenable to precise formalization and an algorithmic treatment. A very short review of MTT isprovided in an appendix. – The formalism, insofar as it is ultimately graphical in nature, does not really extend much pastthat required for describing syntactic disjuncts or semantic relationships. Rather, the point hereis that one can have not just 2-point relationships, such as eat(X, Y) , but more generally, n -pointrelations which themselves may have some internal structure. Thus, one may write r ( x , · · · , x n )for an n-point relation (or constraint) between objects, but also, specific relations may themselveshave a hierarchical structure r ( x , · · · , r k ( y , · · · , y m ) , · · · , x n ) encapsulating a required graphicalsub-structure.The above summarizes the basic software components and theoretical linguistic infrastructure that is pre-sumed, entering into the exercise. To summarize the above from a computational or algorithmic view-point, it is presumed that linguistic relationships and dependencies can be captured in the form of relations r ( x , · · · , x n ), between items x , · · · , x n , together with a parser (or pattern-matcher or constraint-solver)that can, given a lexis of relations, and an input, can find a set of relations that, like puzzle-pieces, assembleto reproduce the input.To be concrete, a Link Grammar connector S − can be more abstractly written as a relation r ( w l , w • , t = S ) denoting that the current word w • must attach to a word w l to the left, using a connector type S .The Link Grammar constraint & specifies that several connectors must be present and connected, andmight be written as a relation r ( c , c ) where each c k is a connector. The hierarchical nesting arises fromthe need to represent a Link Grammar disjunct. So, a transitive verb, normally expressed as S − & O + ,might now be written as r ( c ( w l , w • , t = S ) , c ( w • , w r , t = O )) indicating that the verb w • must have asubject w l on the left and an object w r on the right. However, in the abstract, such a relationship is notfundamentally different from one such as r ( X, Y, s = eat ) for the semantic relation eat(X,Y) , or the relation r ( f = magnitude, noun = rain, modif ier = torrential ) for the lexical function that associates magnitudemodifiers with the object being modified.The role of the parser is to take input at one abstraction level, and generate an output at the nextabstraction level. At the lowest level, parsing consists of taking an ordered string of words, and generating aset of dependency relationships. At the next level, parsing consists of taking a set of dependency relationships,and extracting semantic relationships. In either case, the only allowed parses are those that fulfill theconstraints imposed by the (currently known) relationships. These constraints include structural relations( e.g. being to the left of) as well as type relations ( e.g. being a noun).The point here is that, although we listed the assumed infrastructure in linguistic terminology, the actualrequired infrastructure is not inherently linguistic: rather, it is a system of constraints, each enforced withsome probability or strength, used to analyze an input, and discover the (graphical, typed relationship) struc-ture within it. We believe that this generic infrastructure can be applied to domains outside of linguistics,but will not dwell further on this point here. Given the above linguistic infrastructure, what remains for a language learning system to learn is the linguisticcontent that characterizes a particular language. Specifically, given the assumed framework, key items to belearned are listed below. These are listed roughly in order of sophistication and complexity, with the earlierelements being easier to learn, and being learnt earlier, than the later items.Although the previous section concluded with an abstract view of the infrastructure, of being merely acollection of structural relationships, we revert back to a set of concrete tasks to be achieved. Thus we havethe following: 7
A list of ’link types’ that will be used to form ’disjuncts’ must be learned. – An example of a link type is the ’subject’ link S . This link typically connects the subject of asentence to the head verb. Given the normal English subject-verb word order, nouns will typicallyhave an S +connector, indicating that an S link may be formed only when the noun appears tothe left of a word bearing an S − connector. Likewise, verbs will typically be associated with S − connectors. The current Link Grammar contains roughly one hundred different link-types, withadditional optional subtypes that are used to further constrain syntactic structure. This numberof different link types seems required simply because there are many relationships between words:there is not just a subject-verb or verb-object relationship, but also rather fine distinctions, such asthose needed to form grammatical time, date, money, and measurement expressions, punctuationuse, including street-addresses, cardinal and ordinal relationships, proper (given) names, titles andsuffixes, and other highly constrained grammatical constructions. This is in addition to the usuallinguistic territory of needing to indicate dependent clauses, comparatives, subject-verb inversion,and so on. It is expected that a comparable number of link types will need to be learned. – Some link types are rather strict, such as those that connect verb subjects and objects, while othertypes are considerably more ambiguous, such as those involving prepositions. This reflects thestructure of English, where subject-verb-object order is fairly rigorously enforced, but the orderingand use of prepositions is considerably looser. When considering the looser cases, it becomes clearthat there is no single, inherent ’right answer’ for the creation and assignment of link types, andthat several different, yet linguistically plausible linkage assignments may be made. – The definition of a good link-type is one that leads the parser – applied across the whole corpus –to allow parsing to be successful for almost all sentences, and yet not to be so broad as to enableparsing of word-salads. Significant pressure must be applied to prevent excess proliferation oflink types, yet no so much as to over-simplify things, and provide valid parses for unobserved,ungrammatical sentences. • Lexical entries for different words must be learned. – Typically, multiple connectors are needed to define how a word can link syntactically to others.Thus, for example, many verbs have the disjunct S − & O + indicating that they need a subjectnoun to the left, and an object to the right. All words have at least a handful of valid disjunctsthat they can be used with, and sometimes hundreds or even more. Thus, a “lexical entry” mustbe learned for each word, the lexical entry being a set of disjuncts that can be used with thatword. – Many words are syntactically similar; most common nouns can share a single lexical entry. Yet,there are many exceptions. Thus, during learning, there is a back-and forth process of groupingand ungrouping words; clustering them so that they share lexical entries, but also splitting apartclusters when its realized that some words behave differently. Thus for example, the words “sing”and “apologize” are both verbs, and thus share some linguistic structure, but one cannot say “Iapologized a song to Vicky” because apologize is not a ditransitive verb; if these two verbs wereinitially grouped together into a common lexical entry, they must later be split apart. – The definition of a good lexical entry is much the same as that for a good link type: observedsentences must be parsable; random sentences mostly must not be, and excessive proliferationand complexity must be prevented.We pause here to observe that the distinction between purely syntactic relations, and those with semanticovertones, can be blurry. Basic semantic content can be derived from exceptions to over-generalized syntacticrules, or from a narrowing of the applicability of such rules to finer classes. Thus, for example, “the bookchased a squirrel ” is not likely to be observed during a scan of a large corpus. This can be dealt with at thelexical level: it is a mistake to place “book ” into a class of “nouns”; rather, it belongs to a class of “inanimatenouns”. Likewise, “ chase ” is not merely some transitive verb, but a verb that can only attach to animatesubjects. With sufficient parsimony pressure, it seems reasonable that such finer semantic distinctions couldbe learned at what naively appears to be a purely syntactic level. The extent to which this might take place8epends heavily on the metaphoric content of the input corpus: a literary review might contain a sentence“ the book chased an absurd premise ” suggesting naively that books perhaps are animate. Again, semanticcontent can appear as exceptions to generalized rules. • Semantic relationships must be learned. – The semantic relationship eat(X,Y) is prototypical. Foundationally, such a semantic relationshipmay be represented as a set whose elements consist of syntactico-semantic subgraphs. For therelation eat(X,Y) , a subgraph may be as simple as a single (syntactic) disjunct S − & O + for thenormal word order “ Ben ate a cookie ”, but it may also be a more complex set needed to representthe inverted word order in “a cookie was eaten by Ben ”. – The task here is then to learn synonymous re-phrasings: not just sets of words that are synonyms,but phrases. These need not be centered on a verb, so that “
Wyoming borders on Colorado ” issynonymous to to “
Colorado is a neighbor of Wyoming ”, and both are captured by a prepositionalrelation “ next to(Colorado, Wyoming) ”. Such re-phrasings are at a different abstraction level fromthe syntactic parse level. – The set of all of these different subgraphs defines the semantic relationship. The subgraphs them-selves may be syntactic (as in the examples above), or they may be other semantic relationships,or a mixture thereof. – Not all re-phrasings are semantically equivalent. “
Mr. Smith is late ” has a rather differentmeaning from “
The late Mr. Smith. ” – In general, place-holders like X and Y may be words or category labels. In early stages of learning,it is expected that X and Y are each just sets of words. At some point, though, it should becomeclear that these sets are not specific to this one relationship, but can appropriately take part inmany relationships. In the above example, X and Y must be entities (physical objects), and, assuch, can participate in (most) any other relationships where entities are called for. More narrowly, X is presumably a person or animal, while Y is a foodstuff. Furthermore, as entities, it might beinferred when these refer to the same physical object (see the section ’reference resolution’ below). – Categories can be understood as sets of synonyms, including hyponyms (thus, “ grub ” is a synonymfor “ food ”, while “ cookie” is a hyponym. • Idioms and set phrases must be learned. – English has a large number of idiomatic expressions whose meanings cannot be inferred from theconstituent words (such as “ to pull one’s leg ”). In this way, idioms present a challenge: theirsometimes complex syntactic constructions belie their often simpler semantic content. On theother hand, idioms have a very rigid word-choice and word order, and are highly invariant. Setphrases take a middle ground: word-choice is not quite as fixed as for idioms, but, none-the-less,there is a conventional word order that is usually employed. Note that the manually-constructedLink Grammar dictionaries contain thousands of lexical entries for idiomatic constructions. Inessence, these are multi-word constructions that are treated as if they were a single word.Each of the above tasks have already been accomplished and described in the literature; for example, auto-mated learning of synonymous words and phrases has been described by Lin[LP01] and Poon & Domingos[PD09].However, Lin and Poon & Domingos each assume the pre-existence of a syntactic dependency parser, ratherthan starting from ground zero. The authors are not aware of any attempts to learn all of these, together,in one go, rather than presuming the pre-existence of dependent layers.Furthermore, no previous work has attempted to attack language learning in a fully abstract structuralsetting, although in some sense, work within the framework of Markov Logic Networks (MLN), such as thatof Poon & Domingos, comes close. Taken at an abstract level, MLN combines several distinct components:the notion of a network or graph (and thus similar to the notion of a Bayesian network; the term “Markovian”referring to a certain independence assumption), the idea that a network can express first-order logic (or,more generally, the internal language of category theory), and finally, that unknowns must be distributeduniformly in probability space (by applying a maximum entropy principle). One difficulty with MLN is9ommon to all maximum-entropy (ME) approaches, and that is that solving for the Lagrange variables ofthe ME equations is an NP-hard problem: the potential function can be riddled with local maxima; hill-climbing may be slow and converge to an inappropriate solution. Thus, while in principle, it is useful toexpress fealty to the notion of evenly distributing unknowns, in practice, the algorithms can be slow toconverge. As a result of this, another commonly successful approach is that of Bayesian networks (suchas Hidden Markov Models). Here, probabilities are assigned by a different algorithm (following from anaive application of Bayes theorem) that is usually far faster. However, the application of naive Bayesalmost immediately breaks down due to the need for independence assumptions. So, for example, while “
SriLanka ” is, formally, two distinct words, these cannot be treated independently of one another: it is nearlyimpossible, in English, to use the one word without the other, and so, from the combinatorial viewpoint,these must really count as only one word. Bayesian approaches typically have difficulty with counting. Thus,neither Bayesian nor MLN approaches are entirely satisfactory; a different algorithmic approach is sought,while keeping in place the fundamental concept of a network of relations.Thus, to return to the abstract setting outlined at the end of the previous section: we wish to learn alexis of relations r ( x , · · · , x n ). Treated as relations, these can be viewed as forming a “network”. Insofaras as they are constraints with variables, they can be viewed as a form of “logic”, although the constraintsare not those of first-order logic (as is amply clear from the Link Grammar operators & and “or”: thesedo not form a Boolean algebra, but rather a certain closed monoid). So here, we keep the general notionof a “network”, and recognize that the problem to be solved is to discover the relations, and to uniformly,fairly assign probabilities to each relation. The algorithm for discovering the relations, and to assigningappropriate probabilities and metrics, is discussed in the next section. But first, we continue expanding thelinguistic horizon slightly. While the learning of syntactic and semantic relations is the primary focus of the discussion here, the searchfor semantic structure must not end there; more is possible. In particular, natural language generation has avital need for lexical functions, so that appropriate word-choices can be made when vocalizing ideas. In orderto truly understand text, one also needs, as a minimum, to discern referential structure, and sophisticatedunderstanding requires discerning topics and themes. Discerning such structure seems to be a bare minimumfor what it means to truly “comprehend” language. The aspects discussed here are taken from Meaning-TextTheory (MTT), briefly summarized in the appendix. Because these more abstract structures can again beviewed as graphical relations, and as transformations on the structure of graphs, then these too seem to beamenable to automated discovery.A list of linguistic aspects that seem approachable are listed below. We believe automated, unsupervisedlearning of these aspects is attainable, building on top of the ’simpler’ language language structures describedabove. We are not aware of any prior work aimed at automatically learning these, aside from relativelysimple, unsophisticated (bag-of-words style) efforts at topic categorization. Learning these may prove to beextremely challenging, as the layered, recursive approach requires that the earlier syntactic and semanticlevels be relatively “noise-free” in order for more complex, more abstract structures to be discerned. It is notyet clear that the earlier stages can achieve enough accuracy to allow these later stages to proceed. What’smore, two of these aspects is where linguistics meets the reality of the external world, and are arguablywhere “understanding” takes over from “semantics”. This is discussed further below.So: • Lexical functions should be learned. – Lexical functions are (named) classes of predicate-argument relationships. Thus, for example, thelexical function
Magn() specifies a list of appropriate words for expressing magnitude. One thenhas
Magn ( rain ) == torrential — hard , Magn ( wind ) == strong and Magn ( emotion ) == hot. Similarly, the subject lexical function S () indicates authorship, so S ( crime ) == perpetrator , S ( book ) == author . Lists of synonyms and antonyms are also examples of lexical functions. Intheories of semantics, such as Meaning-Text Theory, dozens of lexical functions are known andwell-defined[MP87, Kah03, Mil06]; there may be more, but parsimony suggests that there cannotbe thousands. 10 Referential structure should be learned. – References may be pronouns, or references to external objects. For example, in “
Patricia went tothe store. She bought a dress. ”, the word “ she ” refers to Patricia. In fact, the words “ dress ” mayalso refer to a specific object in the observer’s environment; however, such a reference cannot beobtained by purely corpus-linguistic methods. That said, in a prolonged discussion about a dress,it should be possible to infer, without external cues, that the entire discussion is about the same,singular dress, as opposed to a different dress every time the word appears. – Referential structure requires a model of the external world, a model of “other” and possibly amodel of “self”. That is, in order to understand that the word “ dress ” always refers to the sameobject requires that there be a model of the world in which there is one distinctive dress whichcan be the object of discussion. Likewise many pronomial constructions require models of otheractors, and also a model of self as an actor, in order to be properly understood. • Topic themes/communicative intent should be learned. – Semantic-communicative structure captures the communicative intent; it partitions a semanticstructure graph into two parts: the ’theme’ (what is being talked about) and the ’rheme’ (whatis being said about the theme). A semantic structure is just a graph of the semantic relationshipsextracted from a sentence. Consider the (rather sophisticated) sentence: “The senator harshlycriticized the Government for its decision to increase income taxes”. This graph will containsemantic relations such as criticize(X,Y,Z) , which is to be understood as ’
X criticizes Y for Z ’;this relation has three arrows from ’ criticize ’ to X , Y and Z . Other arrows link these in turn,to form a directed graph. One partitioning of this graph is to take “ Government’s decision toincrease income taxes ” as the theme, and “ the senator’s harsh criticism ” as the rheme. Thispartitioning is appropriate when entire paragraphs are devoted to the Government’s decision; theSenator’s criticism is but one statement that pops up. – Note that the above partitioning presumes that a rather sophisticated referential structure hasalready been extracted: the theme should appear in multiple sentences, and even in multipleparagraphs, and should already have been identified as ’one and the same thing’ across thesevarious appearances. – This partitioning is not unique. One can also take “the Senator” as the theme, and “criticismof government policy” as the rheme. Such a partitioning would be appropriate when reading theSenator’s biography. In other words, topics and themes cannot be discerned from single sentencesalone, but only become apparent from relationships across many paragraphs.The point here is that there are deeper structures in text, and that these seem like they might be discerniblein a mechanistic fashion. This belief is built on the observation that these structures also take the form ofgraphical relations. Whether the lower syntactic and semantic relations are clean and coherent enough forthese more abstract structures to be discerned is unclear.At any rate, the referential structure is where language meets reality. Natural language understanding,at this point, requires that a model of the external world be accessible, so that referents can be attachedto the objects to which they might refer. This is again an act of “parsing”, of joining the unattached endsof referential connectors to the parts of the external world to which they might plausibly refer to, and thenchecking the entire structure, by means of transitive reasoning, for consistency. We use the word “transitive”here in the same sense that one defines the “transitive closure” of a relation: so, for example, if John sayshe was bitten by an animal, and our model of the external world indicates that a dog was present, one maythen conclude that the “animal” refers to that particular dog, by the application of a single joining relation:a dog is an animal. At any rate, these observations should make clear that this is where learning stopsbeing about language, and instead turns into something else. The something else is different and harder: itrequires models of the external world, and it requires reasoning, so that the model of the external world canbe manipulated to align with the topic of conversation.This also implies that the order of learning proposed here is reversed from the normal direction in humans:children first construct models of the external world based on visual inputs; these models are then adorned11ith associated sounds, which eventually resolve into words, due to their regularity. The regularity of syntaxis the last to be discerned, not the first. Doing the same in a disembodied context is far, far harder: howdoes one discern that the external world consists of persistent objects that have names (nouns) and are inchanging relationships to one-another (verbs)? Because of this reversed order, it is quite possible that thelearning proposed here will founder shortly after the syntactic stage, falling far short of creating a sensoryperception model. At any rate, the proposal here entirely omits any mechanism for constructing a model ofthe external world, and correlating language with it.
The language learning approach presented here is novel in its overall nature. Each part of it, however, drawson prior experimental and theoretical research by others on particular aspects of language learning, as wellas on our own previous work building computational linguistic systems. The goal is to assemble a systemout of parts that are already known to work well in isolation.Prior published research, from a multitude of authors over the last few decades, has already demonstratedhow many of the items listed above can be learnt in an unsupervised setting (see e.g. [Yur98, KM04,LP01, CS10, PD09, SM07, KSPC13] for relevant background). All of the previously demonstrated results,however, were obtained in isolation, via research that assumed the pre-existence of surrounding infrastructurefar beyond what we assume above. The approach proposed here may be understood as a combination,generalization and refinement these techniques, to create a system that can learn, more or less ab initio froma large corpus, with a final result of a working, usable natural language comprehension system.Thus, in some sense, the approach advocated here can be considered to be a mash-up of techniques.However, a concomitant task is to formalize the underlying mathematics of the undertaking, so that it be-comes clear what approximations are being taken, and what avenues remain unexplored. Some fairly specificdirections in this regard suggest themselves, not the least of which is the need to write down appropriateformulations for the distribution of probabilities, and the inequalities that must hold in order for a learningevent to occur.Much of the prior research alluded to above makes use of probabilistic arguments, usually with theimplicit desire of treating unknowns fairly and evenly. In Bayesian arguments, this amounts to a statementabout priors and assumptions about independence. In maximum-entropy methods, the goal is to explicitlydistribute unknown probabilities as evenly as possible. Quite often, an approach is ad hoc , with somearbitrary but plausible metric providing a utility function that can be maximized or minimized. Eachapproach has its pros and cons: independence assumptions get Bayesian methods into trouble; the complexityof the entropic partition function can sometimes make maximum entropy methods intractable, while the adhoc nature of ad hoc approaches stymie a broader theoretical vision for the correct generalization of aphenomenon.The approach advocated below takes a pragmatic stance. That is, while maximum entropy principleswould seem to provide the correct theoretical framework, they also require that it be clearly understood whatis being counted (so that the entropy can be correctly measured). Thus, there is room for applying moretraditional probabilistic reasoning, and even ad hoc simplifications and short-cuts. As in most scientific dis-ciplines, progress here is best achieved by coupling experimental exploration to theoretical and mathematicaldevelopment.
On an abstract conceptual level, the approach proposed here depicts language learning as an instance of ageneral learning loop such as:1. Observe structural relationships between linguistic entities (such as words, or other entities describedin previous sections). Find frequently occurring and novel relationships: this can be done by meansof mutual information (for example), which adjusts for the novelty of a (conditional) relationship byproperly weighting it by the occurrence frequency of the conditions in other contexts. That is, mutualinformation is one good way to pluck novel relationships out of a sea of white noise.12. Give each structural relationship a (unique) name. The name is required so that it can be counted andheld and worked with.3. Treat each structure relationship as a constraint: relations that are not seen are assumed to be pro-hibited. That is, there is no corpus of grammatically incorrect sentences; rather, incorrectness isinferred from silence. New inputs are grammatically “parsable” insofar as they are consistent with therelationships (constraints) that have been observed before.4. Group together similar structural relationships. This is the application of the law of parsimony, or of“Occam’s razor”: simply making a list of all possible observed relationships, as suggested by steps 1and 2, can result in very long lists; a shorter description is desired. That is, one valid way of specifyinga language is to provide an infinite list of grammatically correct sentences. Such a list is not at allcompact; one wishes to group together similar sentences. Steps 1 and 2 suggest how to find the patternsabout which one should group; there remains the actual task of grouping, which is this step. Groupingcan be done using ad hoc distance metrics, to discover similar things. Grouping can also be done byentropy minimization (not maximization!) methods: cutting a list down to N items from 2 N items bygrouping reduces the entropy by log 2 = log 2 N − log N .5. For each such grouping make a category label, and add it to the lexis of expected relations.6. Return to Step 1, and restart observations, but this time, doing so in terms of the known, expectedrelations. So, for example, if the previous round of observations lead to the discovery of groupings suchas nouns and determiners, and the fact that these occur in immediate proximity to one-another, thenthis should be taken as a “known” aspect of language. Armed with this relationship, perhaps otherrelationships can now become clear, such as that nouns can sometimes be subjects, and sometimesobjects of verbs. With each iteration, new relationships presumably emerge; however, they cannotbecome visible or clear until all “known” aspects are already accounted for. Learning a set of constraintssets a new baseline; deviations from the new baseline, if they are strong enough, are candidates fornew relations. Thus the iterative nature of the algorithm.It stands to reason that the result of this sort of learning loop, if successful, will be a hierarchically composedcollection linguistic relationships possessing the following Linguistic Coherence Property:
Linguistic entities are reasonably well characterizable in termsof the compactly describable patterns observable in their relationship with with other linguisticentities.This sort of property has observed to hold for many linguistic entities, an observation dating back atleast to Saussure [dS77] and the start of structuralist linguistics. It is basically a fancier way of saying thatthe meanings of words and other linguistic constructs, may be found via their relationships to other wordsand linguistic constructs. We are not committed to structuralism as a theoretical paradigm, and we haveconsiderable respect for the aid that non-linguistic information –such as the sensorimotor data that comesfrom embodiment – can add to language, as stressed in prior publications[Goe08]. However, the potentialutility of non-linguistic information for language learning does not imply the impossibility or infeasibilityof learning language from corpus data alone. It is inarguable that non-linguistic relationships comprise asignificant portion of the everyday meaning of linguistic entities; but yet, redundancy is prevalent in naturalsystems, and we believe that purely linguistic relationships may well provide sufficient data for learning ofnatural languages. If there are some aspects of natural language that cannot be learned via corpus analysis,it seems difficult to identify what these aspects are via armchair theorizing, and likely that they will only beaccurately identified via pushing corpus linguistics as far as it can go.This generic learning process is a special case of the general process of symbolization, described in [Goe94]and elsewhere as a key aspect of general intelligence. In this process, a system finds patterns in itself andits environment, and then symbolizes these patterns via simple tokens or symbols that become part of thesystem’s native knowledge representation scheme (and hence parts of its ”metalanguage” for describingthings to itself). Having represented a complex pattern as a simple symbolic token, it can then easily lookat other patterns involving this patterns as a component.Note that in its generic format as stated above, the ”language learning loop” is not restricted to corpusbased analysis, but may also include extralinguistic aspects of usage patterns, such as gestures, tones of voice,13nd the physical and social context of linguistic communication. Linguistic and extra-linguistic factors maycome together to comprise ”usage patterns.” However, the restriction to corpus data does not necessarilydenude the language learning loop of its power; it merely restricts one to particular classes of usage patterns,whose informativeness must be empirically determined.In principle, one might be able to create a functional language learning system based only on a verygeneric implementation of the above learning loops. In practice, however, biases toward particular sorts ofusage patterns can be very valuable in guiding language learning. In a computational language learningcontext, it may be worthwhile to break down the language learning process into multiple instances of thebasic language learning loops, each focused on different sorts of usage patterns, and coupled with each otherin specific ways. This is in fact what we will propose here.Specifically, the language learning process proposed here involves: • One language learning loop for learning purely syntactic linguistic relationships (such as link types andlexical entries, described above), which are then used to provide input to a syntax parser. • One language learning loop for learning higher-level ”syntactico-semantic” linguistic relationships (suchas semantic relationships, idioms, and lexical functions, described above), which are extracted fromthe output of the syntax parser. • One language learning loop for associating the resulting syntactico-semantic relationships to a modelof the external world.These three loops are not independent of one-another; the second loop can provide feedback to the first,regarding the correctness of the extracted structures; then as the first loop produces more correct, confidentresults, the second loop can in turn become more confident in it’s output. Likewise, the third loop selectsproper interpretations for the output of the second loop. In this sense, the three loops attack the same sortof slow-convergence issues that ’deep learning’ tackles in neural-net training.The syntax parser itself, in this context, is used to extract directed acyclic graphs (dags), usually trees,from the graph of syntactic relationships associated with a sentence. These dags represent parses of thesentence. So the overall scope of the learning process proposed here is to learn a system of relationshipsthat displays appropriate coherence and that, when applied by an appropriate parser to theinputs from the previous layer, will yield parse trees that reflect the information content inthe input.
The sensory input at the lowest layer is raw text; the output is parse trees, which are then fedas input to the next layer. The process is repeated, until a sufficiently abstract form is obtained, such thatit can be correlated with model of the external world.
The process of learning syntax from a corpus may be understood fairly directly in terms of entropy max-imization. As a simple example, consider the measurement of the entropy of the arrangement of words ina sentence. To a fair degree, this can be approximated by the sum of the mutual entropy between pairsof words. Yuret showed that by searching for and maximizing this sum of entropies, one obtains a treestructure that closely resembles that of a dependency parse[Yur98]. That is, the word pairs with the highestmutual entropy are more or less the same as the arrows in a dependency parse, such as that shown in figure1. Thus, an initial task is to create a catalog of word-pairs with a large mutual entropy (mutual information,or MI) between them. This catalog can then be used to approximate the most-likely dependency parse of asentence.The link-types of such an unlabeled parse tree should be taken as unique to each word-pair; that is, thereis a unique link type to connect any two (connectable) words. For a typical modern language, this impliesmany millions of distinct link types; from the syntactic viewpoint, this is intolerable, its an overly complexdescription of language. The immediately obvious course of action is to somehow group together differentlink types, reducing their number. But if different words share a common link type, then perhaps the wordsshould also be grouped together into common classes. The result of such grouping presumably results in theautomated discovery of something similar to part-of-speech groupings. So, for example, the computation ofword-pair MI is likely to reveal the following high-MI word pairs: “ big car ”, “ fast car ”, “expensive car ”, “redcar ”. (Such word pairs have been previously observed in earlier MI experiments.) It is intuitively obvious14hat one may group together the words big , expensive , fast and red into a single category, interpreted asmodifiers to car . But how might the correctness of such a grouping be automatically verified? The answeris relatively straight-forward: the same modifier grouping can be observed acting on other nouns: e.g. “ bigbicycle ”, “ fast bicycle ”, etc. Two effects are at play here: a reinforcement of the correctness of the originalgrouping of modifiers, but also the suggestion that perhaps cars and bicycles should be grouped together.Superficially, it appears that one can discover two classes of words from this example: modifiers and nouns;crudely put, parts of speech.More importantly, the discovery comes about through a reduction in the total number of syntax rulesat play: rather than having to “remember” (log, record) millions of unique word pairs, it is sufficient toremember a smaller number of word classes, and a smaller number of link types to connect them. Entropy isdefined as the logarithm in the total number of states; complexity may be defined as the entropy in a set ofrules; such clustering is a reduction of the total complexity of the syntactic description. This is biologicallyplausible: the human mind does not maintain millions of syntactic relations, but an apparently much smallerset, possibly in the hundreds, or less.There is an interesting mathematical foundation for understanding the role of link types; it comes fromcategorial grammar[KSPC13]. The link between two word classes carries a type; the type of that link is defined by these two classes. In this example, a link between a modifier and a noun would be a typedenoted as M \ N in categorial grammar, M denoting the class of modifiers, and N the class of nouns. In LinkGrammar, this type name is replaced by a shorter link name, without a slash, but is the same thing. (So,for this example, the existing Link Grammar dictionaries use the A link for the M \ N type, with A meant toconjure up ’adjective’ as a mnemonic.) The short link name is a boon for readability, as categorial grammarsusually have very complex-looking link-type names: e.g. (NP \ S)/NP for the simplest transitive verbs. Thepoint being made here is that typing and type theory[Pro13] provide a good foundation for dealing withthe difficulty of ’naming things’ discovered through classification and clustering. The contents of a clusteris, in a certain sense, ad hoc , and based on what the texts that have been ingested. The relations betweenclusters, however, are not: not only are they dictated by language, but also come with an algebra describinghow they combine: this is type theory. In that sense, typing seems to be an inherent part of language; typetheory appears to provide the correct formalization for discussing it.The Link Grammar dictionaries contain lists of disjuncts, not lists of word-pairs. The last step oflearning a workable grammar is then to discover the disjuncts. This may be done by performing a minimum-spanning-tree (MST) parse of input text[MPRH05, MLP06], driven entirely by the mutual informationobtained between word-pairs. Given that each link is implicitly labeled by the two words it joins, the wordconnectors are trivially extracted as the link type, together with a direction indicator. A disjunct is thensimply the ordered list of all of the connector that land on a given word. Thus, disjuncts can be extractedon a sentence-by-sentence basis after a pass through an MST parser. This then sets the stage for the nextstep of pattern recognition.Given a single word, appearing in many different sentences, one should presumably find that this wordonly makes use of a relatively small, limited set of disjuncts. It is then a counting exercise to determinewhich disjuncts occur the most often for this word, and more, what the disjuncts mutual information shouldbe. The set of these disjuncts then form this word’s lexical entry. Similar to the discovery of high-MIword pairs, this devolves into another ”counting exercise”. Because the structures being discovered arenow subgraphs, instead of word pairs, this is sometimes called “frequent subgraph mining”. This term issomewhat misleading: it is not the absolute frequency of occurrence of the subgraph that is important, but itsrelative (conditional) frequency, conditioned on the frequency of the other parts of the graph that it connectsto. The logarithm of the conditional probability is called the relative entropy, or mutual information, and sothe counting exercise has again devolved into the computation of MI of structures extracted from a corpus.An appendix gives formulas to make these words precise; it is given because a proper definition appears tobe rare in the linguistic literature.At this point, a second clustering step may be applied: its presumably noticeable that many words usemore-or-less the same disjuncts in syntactic constructions. These can then be grouped into a common lexicalentry. Given that a different set of word groupings (into parts of speech) was previously generated, one mayask: how does that grouping compare to this grouping? Is it close, or can the groupings be refined? If thegroupings cannot be harmonized, then perhaps there is a certain level of detail that was previously missed:perhaps one of the groups should be split into several parts. Conversely, perhaps one of the groupings15as incomplete, and should be expanded to include more words. Thus, there is a certain back-and-forthfeedback between these different learning steps, with later steps reinforcing or refining earlier steps, forcinga new revision of the later steps. The precise refinement of how this is to be done awaits experimental trials.
A recognized difficulty with the direct application of Yuret’s observation (that the high-MI word-pair treeis essentially identical to the dependency parse tree) is the flexibility of the preposition in the Englishlanguage[KM04]. The preposition is so widely used, in such a large variety of situations and contexts, thatthe mutual information between it, and any other word or word-set, is rather low (is uniformly distributed,and thus carries little information). The two-point, pair-wise mutual entropy provides a poor approximationto what the English language is doing in this particular case. It appears that the situation can be rescuedwith the use of a three-point mutual information (a special case of interaction information [Bel03]).The discovery and use of such constructs is described in [PD09]. A similar, related issue can be termed“the richness of the MV link type in Link Grammar”. This one link type, describing verb modifiers (whichincludes prepositions) can be applied in a very large class of situations; as a result, discovering this link type,while at the same time limiting its deployment to only grammatical sentences, may prove to be a bit of achallenge. Even in the manually maintained Link Grammar dictionaries, it can present a parsing challengebecause so many narrower cases can often be treated with an MV link. In summary, some constructions inEnglish are so flexible that it can be difficult to discern a uniform set of rules for describing them; certainly,pair-wise mutual information seems insufficient to elucidate these cases.Curiously, these more challenging situations occur primarily with more complex sentence constructions.Perhaps the flexibility is associated with the difficulty that humans have with composing complex sentences;short sentences are almost ’set phrases’, while longer sentences can be a semi-grammatical jumble. In anycase, some of the trouble might be avoided by limiting the corpus to smaller, easier sentences at first, perhapsby working with children’s literature at first.
We now reiterate the syntactic learning process described above in a more systematic way. By getting moreconcrete, we also make certain assumptions, and restrictions, some of which may end up getting changed orlifted in the course of implementation and detailed exploration of the overall approach proposed here. Whatis discussed in this section is merely one simple, initial approach to concretizing the core language learningloop we envision in a syntactic context.Syntax, as we consider it here, involves the following basic entities: • words • categories of words • ”co-occurrence links”, each one defined as (in the simplest case) an ordered pair or triple of words,labeled with frequency counts and mutual information • ”syntactic link types”, each one defined by the two sets of words that are connected • ”disjuncts”, each one associated with a particular word w , and consisting of an ordered set of link typesthat connect to the word w . That is, each of these links contains at least one word-pair containing w as first or second argument. (This nomenclature here comes from Link Grammar; each disjunct is aconjunction of link types. A word is associated with a set of disjuncts. In the course of parsing, onemust choose between the multiple disjuncts associated with a word, to fulfill the constraints requiredof an appropriate parse structure.)An elementary version of the basic syntactic language learning loop described above would take the form.1. Search for high-MI word pairs. Define an initial set of word-pair link types as the given co-occurrencelinks.2. Cluster words into categories based on the similarity of their associated usage links16 Note that this will likely be a tricky instance of clustering, and classical clustering algorithms maynot perform well. One interesting, less standard approach would be to use OpenCog’s MOSESalgorithm to learn an array of program trees, each one serving as a recognizer for a single cluster.3. Define initial syntactic link types from categories that are joined by large bundles of usage links • That is, if the words in category C have a lot of usage links to the words in category C , thencreate a syntactic link type whose elements are ( w , w ), for all w ∈ C , w ∈ C . Remove theword-pair link types associated with ( w , w ), as these are now all subsumed by the new link type.4. Associate each word with an extended set of usage links, consisting of: its existing usage links, plusthe syntactic links that one can infer for it based on the categories the word belongs to. These are the“disjuncts” of Link Grammar. Typically, determiners and adjectives have just one link (linking to themodified noun), nouns have one or two links (one to an adjective, one to a verb) while verbs typicallyhave three or four links (one to a subject, one to an object, possibly a link to a particle or prepositionor other adverb, and one to the head). • For example, suppose cat ∈ C and C has syntactic link L . Suppose ( cat, eat ) and ( dog, run )are both in L . Then if there is a sentence ”The cat likes to run”, the link L lets one infer thesyntactic link cat L → run. The corpus is re-scanned to obtain the frequency of this syntactic link,as well as its mutual information (logarithm of the conditional probability). • Given the sentence ”The cat likes to run in the park,” a chain of syntactic links such as cat L → run L → park may be constructed.5. Look for commonality between disjuncts. This may indicate clusterings of words or link-types that werepreviously missed; alternately, these may indicate that previous clusterings were excessively aggressive.6. Return to Step 2, but using the extended set of usage links produced in Step 4, with the goal ofrefining clusters, the set of link types, and the set of disjuncts for accuracy. Initially, all categoriescontain one word each, and there is a unique link type for each pair of categories. This is an inefficientrepresentation of language, and so the goal of clustering is to have a relatively small set of clusters andlink types, with many words/word-pairs assigned to each. This can be done by maximizing the sumof the logarithms of the sizes of the clusters and link types; that is, by maximizing entropy. Since thecategory assignments depend on the link types, and vice versa , a (very?) large number of iterations ofthe loop are likely to be required. Based on the current Link Grammar English dictionaries, one expectsto discover hundreds of link types (or more, depending on how subtypes are counted), and perhaps athousand word clusters (most of these corresponding to irregular verbs and idiomatic phrases).Many variants of this same sort of process are conceivable, and it’s currently unclear what sort of variant willwork best. But this kind of process is what one obtains when one implements the basic language learningloop described above on a purely syntactic level.How might one integrate semantic understanding into this syntactic learning loop? Once one has semanticrelationships associated with a word, one uses them to generate new ”usage links” for the word, and includesthese usage links in the algorithm from Step 1 onwards. This may be done in a variety of different ways,and one may give different weightings to syntactic versus semantic usage links, resulting in the learning ofdifferent links.The above process would produce a large set of syntactic links between words. We then find a furtherseries of steps. These may be carried out concurrently with the above steps, as soon as Step 4 has beenreached for the first time.1. Given the set of disjuncts, one carries out parsing using a process such as link parsing or word grammarparsing, thus arriving at a set of parses for the sentences in one’s reference corpus. Alternative parsesmay be ranked according to the total mutual information, summed over all disjuncts. Each parse maybe viewed as a directed acyclic graph (dag), usually a tree, with words at the nodes and syntactic-linktype labels on the links. 17. This allows a different set of statistics to be gathered for each disjunct: how often it proves actuallyuseful during (link-typed) parsing. That is, the initial probabilities and entropies for the disjunctsessentially followed from how often they are employed by an MST parser, which generates a spanningtree essentially without regard to the link types. Re-parsing, this time with actual link type agreementenforced, will presumably give similar parses, but presumably more accurate ones. In addition, theusage probabilities will change as a result.In particular, forcing link type agreement might cause some words to be missed in the parse. Foran MST parse, this is not an issue: one simply hunts for some high-MI connection, and attaches tothat. With link types, there may be no valid linkage at all. This suggests the existence of a problemwith the current link type and disjunct assignments: these are somehow incomplete, if they fail to linkall the words in the sentence. In essence, link parsing is stricter than MST parsing; the strictness is asource of feedback for validating the grammar.3. One can now return to Step 2 using the new probabilities, which should suggest new and refinedclusters.Several subtleties have been ignored in the above, such as the proper discovery and treatment of idiomaticphrases, the discovery of sentence boundaries, the handling of embedded data (price quotes, lists, chaptertitles, etc. ) as well as the potential speed bump that are prepositions. Fleshing out the details of this loopinto a workable, efficient design is the primary engineering challenge. This will take significant time andeffort. Syntactic relationships provide only the shallowest interpretation of language; semantics comes next. Onemay view semantic relationships (including semantic relationships close to the syntax level, which we maycall ”syntactico-semantic” relationships) as ensuing from syntactic relationships, via a similar but separatelearning process to the one proposed above. Just as our approach to syntax learning is heavily influencedby our work with Link Grammar, our approach to semantics is heavily influenced by our work on the RelExsystem [RVG05, LGE10, GPPG06, LGK + +
10, LGK +
12] have also been written mapping theoutput of RelEx into even more abstract semantic form, consistent with the semantics of the ProbabilisticLogic Networks [GIGH08] formalism as implemented in the OpenCog [HG08] framework. These systems arelargely based on hand-coded rules, and thus not in the spirit of language learning pursued in this proposal.However, they display the same structure that we assume here; the difference being that here we specifya mechanism for learning the linguistic content that fills in the structure via unsupervised corpus learning,obviating the need for hand-coding.Specifically, we suggest that discovery of semantic relations can proceed in a manner similar to theunsupervised discovery of synonyms, such as that described in [LP01], or it’s generalization from 2-pointrelations to 3-point and N-point relations, as described in [PD09]. These mechanisms allow the automatic,unsupervised recognition of synonymous phrases, such as “Texas borders on Mexico” and “Mexico is Texasneighbor”, to extract the general semantic relation next to(X,Y) , and the fact that this relation can beexpressed in one of several different ways.Simplistically stated, the idea is that semantic learning can proceed by scanning the corpus for sentencesthat use similar or the same words, yet employ them in a different order, or have point substitutions ofsingle words, or of small phrases. Sentences which are very similar, or identical, save for one word, offer upcandidates for synonyms, or sometimes antonyms. Sentences which use the same words, but in seeminglydifferent syntactic constructions, are candidates for synonymous sentences. These may be used to extractsemantic relations: the recognition of sets of different syntactic constructions that carry the same meaning.In practice, the comparisons and search for similarity is not made on the raw text strings, but on the parsedforms of the sentences, so as to avoid issues of word alignment during comparison. Parsing establishes agraph that provides a context for differences in subgraphs.In essence, similar parse structures must be recognized, and then word and parse-tree differences betweenother-wise similar parse graphs are compared. There are two primary challenges: how to recognize similar18raphs, and how to assign probabilities.The work of [PD09] articulates solutions to both challenges. For the first, it describes a general frameworkin which relations such as next to(X,Y) can be understood as lambda-expressions λxλy. next to( x, y ), sothat one can employ first-order logic constructions in place of graphical representations. This is partly anotational trick; it just shows how to split up input syntactic constructions into atoms and terms, forminga term algebra with signature[BN99, Hod97]. For the second challenge, they show how probabilities canbe assigned to the atoms and terms, by making explicit use of the notions of conditional random fields(or rather, a certain special case, termed Markov Logic Networks). Conditional random fields, or Markovnetworks, are a mathematical formalism that describes how entropy can be uniformly distributed across agraphical network where edges and verticies may both be typed, and range over a domain of values. As such,this generalizes a basic, fundamental theorem from information theory, that the probability distribution thatmost evenly distributes unknowns or priors is the same as the probability distribution that maximizes theentropy[Ash65]. Unfortunately, the general theory has several drawbacks: it is quite abstract and dense, andalgorithmically, it falls into the NP-hard class of problems.The procedures described in [LP01] thus provide a much simpler, easier-to-understand introduction to howsemantic information can be extracted. With that simplicity comes two faults: a lack of proper mathematicalgrounding means that it is not clear how to generalize the work to sub-graphs of arbitrary shape (which isprovided by Poon & Domingos), nor is the generalized probabilistic framework articulated. Instead, Lin getsby with a number of ad hoc metrics used to measure semantic similarity. This may be a reasonable approach:the ad hoc similarity metrics have the side effect of taking the NP-hard maximum entropy algorithm andreplacing it with a simpler, more rapidly convergent method.The above can be used to extract synonymous constructions, and, in this way, semantic relations. How-ever, neither of the above references deal with distinguishing different meanings for a given word. Thatis, while eats(X,Y) might be a learnable semantic relation, the sentence “
He ate it ” does not necessarilyjustify its use. Of course: “
He ate it ” is an idiomatic expression meaning “ he crashed ”, which shouldbe associated with the semantic relation crash (X), not eat(X,Y) . There are global textual clues that thismay be the case: trouble resolving the reference “ it ”, and a lack of mention of foodstuffs in neighbor-ing sentences. A viable yet simple algorithm for the disambiguation of meaning is offered by the Mihalceaalgorithm[MTF04, Mih05, SM07]. This is an application of the (Google) PageRank algorithm to word senses,taken across words appearing in multiple sentences. The premise is that the correct word-sense is the onethat is most strongly supported by senses of nearby words; a graph between word senses is drawn, and thensolved as a Markov chain. In the original formulation, word senses are defined by appealing to WordNet,and affinity between word-senses is obtained via one of several similarity measures. Neither of these can beapplied in learning a language de novo (one has neither a WordNet for the language, nor any similarity mea-sures). Instead, these must both be deduced by clustering and splitting, again. So, for example, it is knownthat word senses correlate fairly strongly with disjuncts (based on authors unpublished experiments), andthus, a reasonable first cut is to presume that every different disjunct in a lexical entry conveys a differentmeaning, until proved otherwise. The above-described discovery of synonymous phrases can then be used togroup different disjuncts into a single “word sense”. Disjuncts that remain ungrouped after this process arealready considered to have distinct senses, and so can be used as distinct senses in the Mihalcea network.Sense similarity measures can then be developed by using the above-discovered senses, and measuring howwell they correlate across different texts. That is, if the word “ bell ” occurs multiple times in a sequence ofparagraphs, it is reasonable to assume that each of these occurrences are associated with the same meaning.Thus, each distinct disjunct for the word “ bell ” can then be presumed to still convey the same sense. Onenow asks, what words co-occur with the word “ bell ”? The frequent appearance of “ chime ” and “ ring ” canand should be noted. In essence, one is once-again computing word-pair mutual information, except thatnow, instead of limiting word-pairs to be words that are near each other, they can instead involve far-away This can be understood in a simple, intuitive fashion. Traditional dictionary entries are grouped according to the part ofspeech of a word: different parts of speech are associated with different word senses. The Link Grammar disjunct is like anextremely fine-grained part of speech: it distinguishes not only between noun and verb, but also between the contexts in whichit is used (transitive, ditransitive, with or without modifiers, quantifiers, determiners, particles, etc .) That word senses mightcorrelate with this fine-grained part of speech should come as no surprise. Such correlation is not unique to Link Grammar;it should be directly observable in any dependency grammar. The correlation might be harder to detect in phrase-structuregrammars, since lexical entries are not words, but phrase structures, and thus its not obvious how to correlate word senses tophrase structures. bell ” to include a list of co-occurringwords (and indeed, this is the slippery slope leading to set phrases and eventually idioms).Failures of co-occurrences can also further strengthen distinct meanings. Consider “ he chimed in ” and“ the bell chimed ”. In both cases, chime is a verb. In the first sentence, chime carries the disjunct
S- &K+ (here, K+ is the standard Link Grammar connector to particles) while the second has only the simplerdisjunct S- . Thus, based on disjunct usage alone, one already suspects that these two have a differentmeaning. This is strengthened by the lack of occurrence of words such as “ bell ” or “ ring ” in the first case,with a frequent observation of words pertaining to talking.There is one final trick that must be applied in order to get reasonably rapid learning; this can be looselythought of as “the sigmoid function trick of neural networks”, though it may also be manifested in otherways not utilizing specific neural net mathematics. The key point is that semantics intrinsically involves avariety of uncertain, probabilistic and fuzzy relationships; but in order to learn a robust hierarchy of semanticstructures, one needs to iteratively crispen these fuzzy relationships into strict ones.In much of the above, there is a recurring need to categorize, classify and discover similarity. The naivestmeans of doing so is by counting, and applying basic probability (Bayesian, Markovian) to the resultingcounts to deduce likelihoods. Unfortunately, such formulas distribute probabilities in essentially linear ways( i.e. form a linear algebra), and thus have a rather poor ability to discriminate or distinguish (in the senseof receiver operating characteristics, of discriminating signal from noise). Consider the last example: thelist of words co-occurring with chime , over the space of a few paragraphs, is likely to be tremendous. Mostof this is surely noise. There is a trick to over-coming this that is deeply embedded in the theory of neuralnetworks, and yet completely ignored in probabilistic (Bayesian, Markovian) networks: the sigmoid function.The sigmoid function serves to focus attention on a single stimulus, and elevate its importance, and, at thesame time, strongly suppress all other stimuli. In essence, the sigmoid function looks at two probabilities,say 0.55 and 0.45, and says “let’s pretend the first one is 0.9 and the second one is 0.1, and move forwardfrom there”. It builds in a strong discrimination to all inputs. In standard, text-book probability theory,such discrimination is utterly unwarranted; it runs counter to probability theory. However, applying strongdiscrimination to learning can help speed learning by converting certain vague impressions into certainties.These certainties may be correct or incorrect; it is the task of learning to distinguish the two. The pointhere is that this non-linear behavior provides a kind of amplification, which allows vague impressions to beconverted into certainties that can then be built upon to obtain additional certainties.Thus, in all of the above efforts to gauge the similarity between different things, it is useful to have a sharpyes/no answer, rather than a vague muddling with likelihoods. In some of the above-described algorithms,this sharpness is already built in: so, Yuret approximates the mutual information of an entire sentence as thesum of mutual information between word pairs: the smaller, unlikely corrections are discarded. Clearly, theymust also be revived in order to handle prepositions. Something similar must also be done in the extractionof synonymous phrases, semantic relations, and meaning; the domain is that much likelier to be noisy, andthus, the need to discriminate signal from noise that much more important. We now provide a more detailed elaboration of a simple version of the general semantic learning processdescribed above. The same caveat applies here as in our elaborated description of syntactic learning above:the specific algorithmic approach outlined here is a simple instantiation of the general approach we havein mind, which may well require refinement based on lessons learned during experimentation and furthertheoretical analysis.One way to do semantic learning, according to the approach outlined above, is as follows:1. An initial semantic corpus is posited, whose elements are parse graphs produced by the syntacticprocess described earlier.2. A semantic relationship set (or rel-set ) is computed from the semantic corpus, via calculating thefrequent (or otherwise statistically informative, i.e. high MI) subgraphs occurring in the elements ofthe corpus. Each node of such a subgraph may contain a word, a category or a variable; each node maybe typed by the list of connections it is allowed to make. The links of the subgraph are labeled with(syntactic, or semantic) link types. Each parse graph is annotated with the semantic graphs associated20ith the words it contains (explicitly: each word in a parse graph may be linked via a ReferenceLinkto each variable or literal with a semantic graph that corresponds to that word in the context of thesentence underlying the parse graph.) • For instance, the link combination v S −→ v O −→ v may commonly occur (representing thestandard Subject-Verb-Object (SVO) structure). • For this example, the sentence “the rock broke the window ” would result in links of the formrock
ReferenceLink −−−−−−−−−−→ v connecting the word-instance nodes (the ” rock ” node) in the parse structurewith variable nodes (such as v ) in this associated semantic subgraph.3. Rel-sets are divided into categories based on the similarities of their associated semantic graphs. • This division into categories manifests the sigmoid-function-style crispening mentioned above.Each rel-set will have similarities to other rel-sets, to varying fuzzy degrees. Defining specificcategories turns a fuzzy web of similarities into crisp categorial boundaries; which involves someloss of information, but also creates a simpler platform for further steps of learning. • Two semantic graphs may be called ”associated” if they have a nonempty intersection. Theintersection determines the type of association involved. Similarity assessment between graphs G and G may involve estimation of which graphs G and G are associated with in which ways.Intersection is done by finding the largest common subgraph. • For instance, ”The cat ate the dog” and ”The frog was eaten by the walrus” both representthe semantic structure eat(cat,dog) in two different ways. In link parser terminology, they do sorespectively via the subgraphs G = v S −→ v O −→ v and G = v S −→ v P −→ v MV −→ v J −→ v .These two semantic graphs will have a lot of the same associations. For instance, in our corpuswe may have ”The big cat ate the dog in the morning” (including big A −→ cat) and also ”The bigfrog was eaten by the walrus in the morning” (including big A −→ frog), meaning that big A −→ v isa subgraph commonly associated with both G and G . Due to having many commonly associatedgraphs like this, G and G are likely to be associated to a common cluster. • As in syntactic parsing, a reasonable metric for clustering can be obtained by applying pressureto reduce the overall complexity of the system. Overall complexity is obtained by counting: sum-ming the total number of rules, relations, classes and class sizes needed to capture the contentof the language. Insofar as the logarithm of a count is the entropy, this is an exercise in entropyminimization. The goal is to prune away relations which never occur (because they are seman-tic nonsense: “colorless green ideas ”), leaving behind only those which can be used to expresslegitimate facts.4. Nodes referring to these categories are added to the parse graphs in the semantic corpus. Most simply,a category node C is assigned a link of type L pointing to another node x , if any element of C has alink of type L pointing to x . (More sophisticated methods of assigning links to category nodes mayalso be worth exploring.) • If G and G have been assigned to a common category C , then ”I believe the pig ate the horse”and ”I believe the law was invalidated by the revolution” will both appear as instantiations ofthe graph G = v S −→ believe CV −→ C. This G is compact because of the recognition of C as acluster, leading to its representation as a single symbol. The recognition of G will occur in Step2 the next time around the learning loop.5. Return to Step 2, with the newly enriched semantic corpus.As noted earlier, these semantic relationships may be used in the syntactic phase of language understandingin two ways: • Semantic graphs associated with words may be considered as ”usage links” and thus included as partof the data used for syntactic category formation.21
During the parsing process, full or partial parses leading to higher-probability semantic graphs may befavored.
Note that the above sequence of learning steps vaguely resembles the layering of “deep learning”, or ofhierarchical modeling. That is, learning must occur at several levels at once, each reinforcing, and makinguse of results from another. Link types cannot be identified until word clusters are found, and word clusterscannot be found until word-pair relationships are discovered. However, once link-types are known, these canbe then used to refine clusters and the selected word-pair relations.The learning process described here builds up complex syntactic and semantic structures from simplerones. To start it, all one needs are basic before and after relationships derived from a corpus. Everythingelse is built up from there, given the assumption of appropriate syntactic and semantic formalisms and asemantics-guided syntax parser.However, for this bootstrapping learning to work well, one will likely need to begin with simple language,so that the semantic relationships embodied in the text are not that far removed from the simple before/afterrelationships. The complexity of the texts may then be ramped up gradually. For instance, the needed effectmight be achieved via sorting a very large corpus in order of increasing reading level.
We propose a general algorithm through which syntax and semantics can be induced from a large textcorpus. The algorithm builds on established ideas and demonstrated algorithms from the linguistic andmachine learning fields; the primary novelty proposed here is that these are mature enough to be able to behooked together, so that the results from induction at a lower layer can be used to provide input to a higherlayer, and that, conversely, higher layers can be used to guide learning in the lower layers.The ordering of the layers proposed here is reversed from the ordering one might expect in embodiedcognition, where one might first seek to associate single words with non-verbal sensory inputs, eventuallybuilding a model of the external world labeled with nouns for persistent objects, and verbs for actions. Suchlabeling provides the foundation for semantics in embodied cognition; syntax is learned only later, to encodesemantics that single words alone cannot capture. Here, we reverse the process, attempting to learn syntaxfirst, and only later the semantics, and possible a world-model. Whether this is feasible is remains unclear:neither form of learning, in one direction, or the other, has ever been demonstrated. There are certainly anynumber of traps and pitfalls in the descriptions given above. Nonetheless, we believe that the state of theart in both mathematical theory, and in the power of computer systems is sufficiently advanced that thiscan be attempted.Initial experiments to test some of these hypothesis have been performed by the authors over the lastnumber of years. Work has begun to implement the ideas proposed here. The work is slow, as it is under-staffed; the project described here is large, requiring man-years to demonstrate even a vague prototype,and possibly dozens or hundreds to develop a polished system, with a fully developed theory supported bydetailed published experiments.Current work can be found in the OpenCog codebase: http://github.com/opencog/opencog/ mostspecifically, in the opencog/nlp/learn subdirectory, with supporting code in other directories. The overallOpenCog project is primarily focused on embodied cognition, and so both directions are being explored:learning language via embodied models of the external world, as well as learning language from corpusanalysis, as described here.
References [Ash65] Robert B. Ash.
Information Theory . Dover Publications, 1965.[Bel03] Anthony J. Bell. The co-information lattice.
Somewhere or other , 2003.22BN99] Franz Baader and Tobias Nipkow.
Term rewriting and all that . Cambridge University Press,1999.[CS10] Shay B. Cohen and Noah A. Smith. Covariance in unsupervised learning of probabilistic gram-mars.
Journal of Machine Learning Research , 11:3117–3151, 2010.[dS77] Ferdinand de Saussure.
Course in General Linguistics . Fontana/Collins, 1977. Orig. published1916 as ”Cours de linguistique g´en´erale”.[Gib98] Edward Gibson. Linguistic complexity: locality of syntactic dependencies.
Cognition , 68:1–76,1998.[GIGH08] B. Goertzel, M. Ikle, I. Goertzel, and A. Heljakka.
Probabilistic Logic Networks . Springer, 2008.[Goe94] Ben Goertzel.
Chaotic Logic . Plenum, 1994.[Goe08] Ben Goertzel. A pragmatic path toward endowing virtually-embodied ais with human-levellinguistic capability. IEEE World Congress on Computational Intelligence (WCCI), 2008.[GPA +
10] Ben Goertzel, Cassio Pennachin, Samir Araujo, Ruiting Lian, Fabricio Silva, Murilo Queiroz,Welter Silva, Mike Ross, Linas Vepstas, and Andre Senna. A general intelligence oriented ar-chitecture for embodied natural language processing. In
Proceedings of the Third Conference onArtificial General Intelligence . Springer, 2010.[GPPG06] Ben Goertzel, Hugo Pinto, Cassio Pennachin, and Izabela Freire Goertzel. Using dependencyparsing and probabilistic inference to extract relationships between genes, proteins and malig-nancies implicit among multiple biomedical research abstracts. In
Proc. of Bio-NLP 2006 , 2006.[HG08] David Hart and Ben Goertzel. Opencog: A software framework for integrative artificial generalintelligence. In
Proceedings of the First Conference on Artificial General Intelligence . IOS Press,2008.[Hod97] Wilfred Hodges.
A Shorter Model Theory . Cambridge University Press, 1997.[Hud84] Richard Hudson.
Word Grammar . Oxford: Blackwell, 1984.[Hud07] Richard Hudson.
Language Networks: The New Word Grammar . Oxford Linguistics, 2007.[iC06] R. Ferrer i Cancho. Why do syntactic links not cross?
EPL (Europhysics Letters) , 76(6):1228–1234, 2006.[Kah03] Sylvain Kahane. The meaning-text theory.
Dependency and Valency. An International Handbookof Contemporary Research , 1:546–570, 2003.[KM04] Dan Klein and Christopher D. Manning. Corpus-based induction of syntactic structure: Modelsof dependency and constituency. In
ACL ’04 Proceedings of the 42nd Annual Meeting on Associ-ation for Computational Linguistics , pages 479–486. Association for Computational Linguistics,2004.[KSPC13] Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Stephen Pulman, and Bob Coecke. Reasoning aboutmeaning in natural language with compact closed categories and frobenius algebras. 2013.[LGE10] Ruiting Lian, Ben Goertzel, and Al Et. Language generation via glocal similarity matching.
Neurocomputing , 2010.[LGK +
12] Ruiting Lian, Ben Goertzel, Shujing Ke, Jade O ˜ONeill, Keyvan Sadeghi, Simon Shiu, DingjieWang, Oliver Watkins, and Gino Yu. Syntax-semantic mapping for general intelligence: Languagecomprehension as hypergraph homomorphism, language generation as constraint satisfaction. In
Artificial General Intelligence: Lecture Notes in Computer Science Volume 7716 . Springer, 2012.23Liu08] Haitao Liu. Dependency distance as a metric of language comprehension difficulty.
Journal ofCognitive Science , 9(2):159–191, 2008.[LP01] Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In
Proceedings ofthe Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’01) , pages 323–328. ACM Press, 2001.[Mih05] Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-basedalgorithms for sequence data labeling. In
HLT ’05: Proceedings of the conference on HumanLanguage Technology and Empirical Methods in Natural Language Processing , pages 411–418,Morristown, NJ, USA, 2005. Association for Computational Linguistics.[Mil06] Jasmina Mili´cevi´c. A short guide to the meaning-text linguistic theory.
Journal of Koralex ,(8):187–233, 2006.[MLP06] Ryan McDonald, Kevin Lerman, and Fernando Pereira. Multilingual dependency analysis with atwo-stage discriminative parser. In
CoNLL-X ’06: Proceedings of the Tenth Conference on Com-putational Natural Language Learning , pages 216–220, Morristown, NJ, USA, 2006. Associationfor Computational Linguistics.[MP87] Igor A. Mel’ˇcuk and Alain Polguere. A formal lexicon in meaning-text theory.
ComputationalLinguistics , 13:261–275, 1987.[MPRH05] Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajiˇc. Non-projective dependencyparsing using spanning tree algorthms. In
HLT-EMNLP 05 Proceedings of the conference onHuman Language Technology and Empirical Methods in Natural Language Processing , pages523–530, Morristown, NJ, USA, 2005. Association for Computational Linguistics.[MTF04] Rada Mihalcea, Paul Tarau, and Elizabeth Figa. Pagerank on semantic networks, with appli-cation to word sense disambiguation. In
COLING ’04: Proceedings of the 20th internationalconference on Computational Linguistics , Morristown, NJ, USA, 2004. Association for Compu-tational Linguistics.[PD09] Hoifung Poon and Pedro Domingos. Unsupervised semantic parsing. In
Proceedings of the2009 Conference on Empirical Methods in Natural Language Processing , pages 1–10, Singapore,August 2009. Association for Computational Linguistics.[Pro13] The Univalent Foundations Program.
Homotopy Type Theory: Univalent Foundations of Math-ematics . Institute for Advanced Study, 2013.[RVG05] Mike Ross, Linas Vepstas, and Ben Goertzel. Relex semantic relationship extractor.http://opencog.org/wiki/RelEx, 2005.[SM07] Ravi Sinha and Rada Mihalcea. Unsupervised graph-basedword sense disambiguation usingmeasures of word semantic similarity. In
ICSC ’07: Proceedings of the International Conferenceon Semantic Computing , pages 363–369, Washington, DC, USA, 2007. IEEE Computer Society.[ST91] Daniel Sleator and Davy Temperley. Parsing english with a link grammar. Technical report,Carnegie Mellon University Computer Science technical report CMU-CS-91-196, 1991.[ST93] Daniel D. Sleator and Davy Temperley. Parsing english with a link grammar. In
Proc. ThirdInternational Workshop on Parsing Technologies , pages 277–292, 1993.[Ste90] James Steele, editor.
Meaning-Text Theory: Linguistics, Lexicography, and Implications . Uni-versity of Ottowa Press, 1990.[Tem07] David Temperley. Minimization of dependency length in written english.
Cognition , 105:300–333,2007.[Tes59] Lucien Tesni`ere. ´El´ements de syntaxe structurale . Klincksieck, Paris, 1959.24WP-a] Argument. http : //en.wikipedia.org/wiki/Arguments ( linguistics ).[WP-b] Boolean satisfiability problem. http : //en.wikipedia.org/wiki/Boolean s atisf iability p roblem .[WP-c] Conditional mutual information. http : //en.wikipedia.org/wiki/Conditional m utual i nf ormation .[WP-d] Dpll algorithm. http : //en.wikipedia.org/wiki/DP LL a lgorithm .[WP-e] Predicate. http : //en.wikipedia.org/wiki/P redicate ( grammar ).[Yur98] Deniz Yuret. Discovery of Linguistic Relations Using Lexical Attraction . PhD thesis, MIT, 1998.
Appendix A: Meaning-Text Theory
The most comprehensive theory of meaning, and its conversion to text is Meaning-Text Theory (MTT)[MP87, Kah03, Mil06, Ste90]. Although the theory itself is primarily oriented towards the generation of textfrom meaning, I believe that its representation of meaning is ideal for extracting meaning from text. MTT isnot only compatible with dependency grammar, it can be thought of as an extension to it; it provides rulesfor converting meaning into dependency graphs. These rules are quite specific, and thus lend themselves toan algorithmic implementation: this is another strength of MTT. Possibly the most important contributionof MTT to linguistics is the discovery of “lexical functions”, which map concepts to words.At it’s core, MTT captures meaning with a “semantic representation” (SemR). A semantic representationis a network of predicate-argument relations (in the sense of a linguistic predicate[WP-e] and linguisticargument[WP-a]). An example of such a network is shown in figure 3. Each arrow in the figure is a semanticdependency, that is, a predicate-argument relationship (arguments are also called ’ semantic actants ’ in thetheory). The arrows are labeled with numbers corresponding to the valency or number of arguments thata node may have. Thus, for example, the verb “ criticize ” has a valency of three: “
X criticizes Y for Z ”.Nodes in the graph may be primitive or atomic, in which case they are called ’ semes ’, although they mayalso have structure and are thus called ’ semantemes ’. Note that semantemes are highly lexicalized: thatis, the definition of a semanteme specifies the number of actants, and their roles. In essence, semantemescorrespond to dictionary entries (a ’ lexis ’ is a dictionary).More properly, the network described above is a ’SemS’ (semantic structure); it is but one part, althougha core part, of a SemR. The network captures the propositional/situational meaning at the core. The otherthree parts of SemR are termed Sem-CommS, RhetS and RefS. The Sem-CommS (sematic-communicativestructure) captures the communicative intent; the ’ theme ’ (what is being talked about) and the ’ rheme ’(what is being said about the theme). For example, in figure 3, the topic is ’media’, and the rheme iswhat the media is talking about (a raise in taxes). There is no unique assignment of themes and rhemesto a network, thus, for example, in figure 4, the theme is the raise in taxes, and the rheme is what isbeing said about it: the media is criticizing it. The Sem-CommS also has several other parts, the mostimportant of which is distinguishing what is new information from what is given; that is, differentiatingwhat is asserted from background pre-suppositions. A general axiom of MTT is that meaning is invariantunder paraphrasing: thus, for example, “
The media harshly criticized the Government for its decision toincrease income taxes “ and “
The Government’s decision to increase income taxes was severely criticizedby the media ” are roughly synonymous; thus the network underlying the two figures 3 and 4 is the same.Tightening down the distinction between new and pre-supposed information breaks synonymy; it narrowsthe range of sentences that can be considered synonymous, of saying the same thing. Thus, MTT can capturethe finer points of a speech-writer’s art: the careful crafting of sentences to convey a very specific meaning.The other parts of SemR are RhetS and RefS. RhetS specifies the rhetorical style used in converting asemantic network to text (such as headline-news, where sentences are clipped (’
Thieves rob Bank ’); informalspeech with lots of lulz, typoes, ikr, smh and smiley winks ;-¿, or proper newspaper-English.) The RefScaptures the referential structure: the references to concrete objects in the (model of the) external environ-ment (’ the ball rolled under the sofa ’ refers to a specific ball and sofa) or anaphora (the pronoun ’ she ’ in ’ shewaved goodbye ’ refers to a specific person).MTT distinguishes between several different forms or levels of representation; the above described isSemR, the semantic representation. There is also a SyntR, the syntactic representation, roughly correspond-ing to a dependency parse, as well as MorphR, a morphological representation, where not only has the25ord-order been chosen, but so have verb tenses, inflections of nouns and the like. A PhonR representationgives the spoken, phonemic form of a sentence. Each of these levels has additional structures that captureimportant information (such as verb tense, choice of adjective or adverb modifiers and the like). Betweeneach of these levels are a set of correspondence rules that translate structures at one level to those in another.Roughly speaking, correspondence rules can be thought of as functors that map networks at one level tothose of another. MTT attempts to treat these rules as mechanically as possible, with an eye to algorithmicimplementation.Perhaps the most important contribution of MTT to linguistics is the discovery of ’ lexical functions ’(LF’s); these appear in the correspondence rules. Lexical functions bind a meaning to a lexeme. Forexample, the LF
Magn() is a function that specifies a list of appropriate words for expressing magnitude.One then has
Magn ( rain ) == torrential — hard , Magn ( wind ) == strong and Magn ( emotion ) == hot. Thepoint here is that in each case, a magnitude is being expressed; yet, it is context-specific: one would notnormally say ’ hot rain ’, ’ hard wind ’ or ’ torrential emotion ’; the LF specifies the allowable modifiers for agiven noun. The LF
Magn() is broad in scope, applying to nouns in general; not all LF’s are universal inthis way. Some can have a very narrow scope: for example, ’leap’ only applies to ’year’. Another exampleis the subject LF S () that indicates authorship: S ( crime ) == perpetrator , S ( book ) == author . A thirdexample is the quasi-subject function that gives noun-equivalents for verbs: QS ( criticize ) == criticism .MTT defines many different lexical functions; a sampling of these for the verb ’ criticize ’, taken from [Mil06],are: QSyn : attack, disapprove, reproach
QAnti : praise, congratulate QS : criticism S : critic A : critical [of N] Magn : bitterly, harshly, seriously, severely, strongly
Magn
Quant : all the time, relentlessly, without stopping
AntiMagn : half-heartedly, mildly
AntiVer : unjustly, without reasonAs should be apparent from the above, lexical functions provide all of the relations that can be foundin resources like WordNet ( viz, a machine-readable catalog of word-senses, their synonyms, antonyms andhypernyms), while providing a more comprehensive view of the nature of these relations. Similarly, thepredicate-argument lexical entries of MTT resemble the frames of FrameNet. Unlike FrameNet, however,MTT describes the mechanisms by which these semantic frames become connected to syntactic representa-tions. It provides a more comprehensive description of how frames interact with other aspects of grammar.To recap: meaning is captured by referential structures, obtained from semantic structures built oflexical functions. The goal of learning language is to ascend through a hierarchy of structures, from rawtext, through dependency grammars, up to referential disambiguation of a world-model. To make sure thatlearning takes place at a reasonable pace, deep-learning style reinforcement must happen at each layer, sothat the simpler, shallower layers (the syntactic layers) are somewhat developed before the semantic layersare attacked; yet that the deeper layers can also guide correct learning at the shallower layers. MTT, asopposed to some other theory or framework, seems appropriate for providing a basic framework of whatmust be learned. This is in part because MTT seems to be the most comprehensive theory describing notonly the various layers, but also algorithmic mechanisms of transforming one layer into another.The language learning system must learn lexical functions on its own; it must learn how to pick outpredicate-argument structures; these are not pre-supposed. Rather, MTT provides a viewpoint by which thesuccess of the learning system might be judged: instead of treating learning as a black-box, one might expectbeing able to examine what is being learned, and one might reasonably expect it to resemble the outlines ofMTT. 26 ppendix B: Mutual Information
This appendix provides some gymnastics for working with probabilities associated with structures and rela-tions. It is provided only because such discussions are rare in the literature. Only the very simplest casesis worked here: the mutual information between a pair of words. Hopefully, generalizations then becomeobvious.Let P ( R ( w l , w r )) represent the probability (frequency) of observing two words, w l and w r in somerelationship or pattern R . Typically, in Link Grammar, it would be a linkage of type t connecting word w l on the left to word w r on the right; however, the relation R can be more general than that.The simplest model has only one type t , the ANY type, and assigns equal probabilities to all words.But we know all words are not equi-probable, so let P ( w ) be the probability of observing word w . Weknow from experience this is a Zipfian distribution. We are then interested in the conditional probability P ( R ( w l , w r ) | w l , w r ) of observing the two words w l and w r in a relation R = R ( w l , w r ), given that the twoindividual words were observed. From the definition of conditional probabilities, one has that P ( R ) = P ( R | w l , w r ) P ( w l ) P ( w r )or, equivalently, that P ( R | w l , w r ) = P ( R ) P ( w l ) P ( w r )Here, the relation R encompasses several facts: that one word is to the left of the other, and that they areconnected by a certain link-type, as well as capturing other ’ambient’ information, perhaps such as othernearby words.It is important here to harmonize this with the notation used by Yuret[Yur98] and commonly employedin the MST parser literature. There, a probability P ( w l , w r ) is defined of seeing the ordered pair; that is,the relation R is implicit. To make it explicit, we should write: P ( w l , w r ) = P ( R ( w l , w r )) to indicate therelation explicitly, and to note that the order of the positions in the relation matter. Yuret also uses thenotation P ( w l , ∗ ) and P ( ∗ , w r ) for wild-card summations, defined as P ( w l , ∗ ) = (cid:88) w r P ( w l , w r ) and P ( ∗ , w r ) = (cid:88) w l P ( w l , w r )In practical use, one quickly observes that P ( w l , ∗ ) is almost equal to P ( w l ) but not quite, since P ( w l , ∗ )is the probability of seeing w l within the certain relationship or pattern, which must be less than the priorprobability of observing w l in general. Thus, one has P ( w l , ∗ ) ≤ P ( w l ) which can be viewed as a conditionalprobability: P ( w l , ∗ ) = P ( R ( w l , ∗ )) = P ( R ( w l , ∗ ) | w l ) P ( w l )In practice, then, for word-pairs, one has that P ( R | w l ) is almost equal to 1, but not quite. Inserting thisinto the above gives P ( R ( w l , w r ) | w l , w r ) = P ( R ( w l , w r )) P ( R ( w l , ∗ ) | w l ) P ( R ( ∗ , w r ) | w r ) P ( w l , ∗ ) P ( ∗ , w r )Re-ordering this gives P ( R ( w l , w r ) | w l , w r ) P ( R ( w l , ∗ ) | w l ) P ( R ( ∗ , w r ) | w r ) = P ( w l , w r ) P ( w l , ∗ ) P ( ∗ , w r ) (1)The right hand side above is recognizable from Yuret’s work; he defines the mutual information asMI( w l , w r ) = log P ( w l , w r ) P ( w l , ∗ ) P ( ∗ , w r )so that large positive MI is associated with words that occur together only with themselves ( e.g. “NorthernIreland ”, from his examples.) So, on the right, we have that P ( w l , w r ) is usually very small, and that P ( w, ∗ ) ≈ P ( ∗ , w ) ≈ P ( w ) subject to the inequality given before.The LHS of equation (1) shows how to properly normalize conditional probabilities for general structureswhen performing “frequent subgraph mining”. First, we have observationally seen that P ( R ( w l , ∗ ) | w l ) ≈ ( R ( ∗ , w r ) | w r ) ≈
1, and thus must conclude that P ( R ( w l , w r ) | w l , w r ) is ’large’; much larger than (uncondi-tional) word-pair frequencies.The LHS of equation (1) then demonstrates how to obtain conditional entropies in general. Thus, givenan n -point relation R ( x , x , · · · , x n ) one computes first the unconditional probability P ( R ( x , x , · · · , x n )).The conditional probability is then obtained as usual: P ( R ( x , x , · · · , x n ) | x , x , · · · , x n ) = P ( R ( x , x , · · · , x n )) P ( x ) P ( x ) · · · P ( x n )The entropy is then build recursively by normalizing by the probability of wild-card relations: M I ( R ( x , x , · · · , x n ) | x , x , · · · , x n )) = log P ( R ( x , x , · · · , x n ) | x , x , · · · , x n )) P ( R ( ∗ , x , · · · , x n ) | x , · · · , x n )) P ( R ( x , ∗ , · · · , x n ) | x , x , · · · , x n )) · · · The point of this derivation is to provide a simpler practical formulation for working with structural rela-tions in language. Most presentations of conditional mutual information obscure the structural relationships,by hiding the wild-card summations in a different notation that makes it hard to discern their presence; see,for example [WP-c] for a demonstration of an equivalent but more opaque notation. Do observe that theRHS above is a special case, though, where the x k appearing in the relation are identical to those appearingin the condition. When these are not the same, then a summation is required, as usual: I ( R ( X , X , · · · , X n ) | Z )) = (cid:88) z ∈ Z P ( z ) (cid:88) x k ∈ X k P ( R ( x , x , · · · , x n ) | Z )) log P ( R ( x , x , · · · , x n ) | Z )) P ( R ( ∗ , x , · · · , x n ) | Z )) · · · The difference between this and the previous equation is that, when the X k = { x k } are all singleton sets,and Z = { z = ( x , x , · · · , x n ) } is likewise a singleton set, then the summations disappear. One is left witha mutual information, scaled by the (unconditioned) probability of seeing the particular pattern. Becausethe pattern may in fact be very rare, this is not as useful in practical experimentation than the renormalizedmutual information. 28igure 3: A Semantic RepresentationAn example of a semantic representation (SemR) from Meaning-Text Theory. This network of predicate-argument arrows captures the meaning of the sentence “ The media harshly criticized the Government for itsdecision to increase income taxes. “ Here, “ media ” is the topic or theme, and what the media is saying is therheme. Compare this to figure 4.Figure taken from [Mil06] Figure 4: Alternative Semantic TopicAn alternative partitioning of a semantic network into theme and rheme. In this case, the topic is “ theGovernments decision to raise taxes ”, and what is being said about this topic is that the media is sharplycritical of it. In words, one could say that “