The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point
aa r X i v : . [ c s . C L ] F e b Noname manuscript No. (will be inserted by the editor)
The Singleton Fallacy
Why Current Critiques of Language Models Miss the Point
Magnus Sahlgren · Fredrik Carlsson
Received: date / Accepted: date
Abstract
This paper discusses the current critique against neural network-based Natural Language Understanding (NLU) solutions known as languagemodels . We argue that much of the current debate rests on an argumentationerror that we will refer to as the singleton fallacy : the assumption that lan-guage, meaning, and understanding are single and uniform phenomena thatare unobtainable by (current) language models. By contrast, we will arguethat there are many different types of language use, meaning and understand-ing, and that (current) language models are build with the explicit purpose ofacquiring and representing one type of structural understanding of language.We will argue that such structural understanding may cover several differentmodalities, and as such can handle several different types of meaning. Ourposition is that we currently see no theoretical reason why such structuralknowledge would be insufficient to count as “real” understanding.
Keywords
Natural Language Understanding · Neural Networks · LanguageModels · Meaning · Philosophy of Language
We are at an inspiring stage in research on Natural Language Understanding(NLU), with the development of models that are capable of unprecedentedprogress across a wide range of tasks [31]. At the same time, there are criticalstudies being published that demonstrate limitations of our current solutions[20,18,24], and more recently, voices have been raised calling for, if not takinga step back, then at least to stop for a moment and recollect our theoretical
RISE (Research Institutes of Sweden)NLU Research GroupSwedenE-mail: [email protected], [email protected] Magnus Sahlgren, Fredrik Carlsson bearings [2,1]. Even if these latter theoretical contributions have slightly dif-ferent perspectives — [2] introduce the notions of
World Scopes as a way toargue for the futility of using only text data to train NLU models, while [1]posit a strict distinction between form and meaning, arguing that models onlytrained on form cannot grasp meaning — they share what we consider to be ahealthy skepticism of the currently somewhat opportunistic and methodolog-ically narrow-minded development.The main controversy in this recent debate is the question to what ex-tent our current NLU approaches — i.e. predominantly Transformer neuralnetwork language models — can be said to really understand language, andwhether the currently dominating research direction has any potential at allto lead to models with actual understanding. To put the point succinctly: willit eventually prove to be enough to train a thousand-layer trillion-parameterTransformer language model on the entire world’s collected texts, or do we needsomething more or something else to reach true NLU (and what is “true” NLUanyway)? The recent excitement and hype in news and popular science presssurrounding GPT-3 (see e.g. [30] and [17]) of course does nothing to dampenthis controversy. While we share the overall assessment that more theoreticalconsiderations would be beneficial for current and future NLU development,we think that both [2] and [1] oversimplify important core discussion points.Our sentiment is hence that some of the presented arguments in the debateare somewhat misplaced and insufficient, as we find them to misrepresent theinherent complexity in discussing controversial topics such as meaning and un-derstanding. This contribution therefore aims to analyze, and hopefully clarify,some of these arguments while also raising some novel discussion points of itsown.
It is always precarious to build arguments on inherently vague and generalconcepts such as “language”, “understanding” and “meaning,” as the resultingtheoretical constructs may become so overly general that they almost becomevacuous. In this first section, we discuss how these terms are used in thecurrent debate, and we argue that most of the current critique of the semanticcapabilities of language models rest on a misunderstanding we refer to as thesingleton fallacy . In short, this argumentation error consists in assuming thata term refers to a single uniform phenomenon, when in practice the term canrefer to a large set of phenomena connected by family resemblances. Sections2.1 and 2.2 discusses the concepts of “language” and “understanding,” whileSection 2.3 focuses on how current language models understand language.2.1 Language is Not One Single ThingLanguage is normally defined as the system of symbols that we use to commu-nicate, and learning a language entails learning the set of symbols and rules he Singleton Fallacy 3 that define the system. Learning the set of symbols equals vocabulary acqui-sition, while learning the rules entails recognizing and formalizing grammati-cal, morphological, and syntactic regularities. We measure these competenciesin humans — often indirectly — by using various language proficiency tests,such as vocabulary tests, cloze tests, reading comprehension, as well as variousforms of production, interaction, and mediation tests (such as translation andinterpretation). To evaluate our current NLU solutions, we often use specifi-cally designed test sets, such as GLUE [32], SuperGLUE [31] and Dynabench[19], or more specific probing tasks that attempt to more directly measurea model’s capacity to represent a specific linguistic phenomenon [29,15,13].Even if current language models have been shown to underperform in somespecific test settings (such as their ability to handle negation [9]), there is anoverwhelming body of empirical results to demonstrate that current languagemodels have a passable capacity to detect the symbols and rules that definehuman language. This is not what is under dispute in the current debate; what is under dispute is whether such structural knowledge about the workings ofthe language system suffices.Of course, the question is: suffices for what? Presumably, we develop NLUsystems in order to do the things we humans do with language. And here isthe complication; we do not only do one thing with language. We humans dolots of different things with language, ranging from primal vocal expressions,over basic naming of objects and responding to simple questions, to morecomplicated tasks such as following instructions, arguing, or participating innegotiations. Language behavior is decidedly not one single activity, but acollection of many interrelated competencies and activities that together con-stitute the totality of human linguistic behavior. [34] refers to the relationsbetween these interrelated linguistic activities as family resemblances , and heexplains the situation thus: “Instead of producing something common to allthat we call language, I am saying that these phenomena have no one thing incommon which makes us use the same word for all — but that they are relatedto one another in many different ways. And it is because of this relationship,or these relationships, that we call them all ‘language’.”All humans have a slightly different set of linguistic abilities, and even if twolanguage users share a linguistic ability — e.g. arguing — they are typicallynot equally good at it. Linguistic proficiency is a continuous scale, rangingfrom more or less complete incompetence to more or less complete mastery.We normally do not think about (human) language learning and linguisticproficiency as a pursuit of one single ultimate goal, whose completion is abinary outcome of either success or failure, but rather as a collection of tasksthat can be performed in a number of different ways. When trying to definea criteria for quantifying linguistic proficiency, one must acknowledge that atotality of these linguistic skills is never actually manifested (as far as weknow) in one single human language user. All current language learning testsare modelled according to this assumption, and hence deliver scalar results ona number of different test settings. Magnus Sahlgren, Fredrik Carlsson
We suggest that it may be more productive to explicitly recognize thatlanguage behavior covers a plethora of activities, and that linguistic compe-tence can only be measured on a continuous scale. Doing so will also fostermore realistic expectations on NLU solutions. Instead of demanding them tobe flawless, generic and applicable to every possible use case, we may be betteroff adopting the same type of expectations as with human language users; theywill all be different, and — importantly — good at different things.2.2 How Should We Understand “Understanding”?The main controversy in the current debate is not so much whether languagemodels can be trained to perform various types of language games (most com-mentators seem to agree that they indeed can). The main controversy is insteadwhether a language model that has been trained to perform some languagegame actually has any real understanding of language, or if it is confined topurely structural knowledge. In order to discuss this question seriously, wefirst need to understand what “understanding” means. [1] propose that under-standing is pairing an expression with the correct meaning. We argue that thisis a clear instance of the singleton fallacy (i.e. assuming that understanding isone single thing), and it is certainly not consistent with how we use the term“understanding” in normal language use. Perhaps a couple of examples can beenlightening.Let us begin by considering a type of expression that seems to be of majorconcern for [2]: an instruction such as “perform a backflip and land in a split.”We can encounter a number of different types of criteria for understanding suchan instruction in normal language use. One criterion might be to simply tobe able to write an essay about acrobatics and use the expression consistently(and correctly) in text. Another is the ability to point to someone who actuallyperforms a backflip and lands in a split, or to draw an image of someoneperforming the action. Yet another criterion is the ability to actually performa backflip and land in a split, or even to just attempt and fail. On the otherhand, if someone replies to the instruction with a profanity, or by simplyleaving the room, we may be more hesitant to grant them the capacity ofunderstanding. Unless, of course, we actually intended the expression to be aninsult.An instruction is admittedly an example of a relatively complicated lan-guage game. Let us for the sake of argument also consider one of the arguablysimplest forms of linguistic expressions: a concrete noun such as “door.” Whendo we say that someone understands this simple word? Is it enough if they canuse the word correctly in sentences, and exchange it for other similar words incontext, or do they need to be able to visualize the object, and to be able topoint out a correct object in the world? What about a situation when someoneapproaches a door carrying some heavy object, upon which the person looks atyou and calls out “door!” Would it count as correct understanding in this caseto reply “yes, it is” (and, optionally, to point at the door), or do you need to be he Singleton Fallacy 5 able to infer that the person needs assistance with opening the door, and thatyou therefore should open the door for her? We would probably expect mosthuman language users to open the door in response to such an utterance, butthere will definitely be some variation of outcomes, with some people not beingable to grasp the performative expectations of the speaker (but still having aperfectly good understanding of the term “door” and what it refers to), andsome people perhaps instead offering to carry the load instead of opening thedoor (and thereby literally misinterpreting the speaker, while still performinga suitable action).The point of these somewhat wordy examples is to demonstrate that therecan be many different types of (correct) understanding of a given linguisticexpression. One type is the intra-linguistic, structural type of understandingthat enables the subject to produce coherent linguistic output. Another is thereferential understanding that enables the subject to identify (and visualize)corresponding things and situations in the world, a third is the social under-standing that enables the subject to interpret other peoples intentions, and apossible fourth type is performative understanding where a person can (try to)accomplish the action being mentioned. These different types of understandingmap approximately to [2] World Scopes 1–3 (intra-linguistic understanding), 4(referential understanding), and 5 (social understanding). In more traditionallinguistic terms, we might use the terms conventional meaning, referential meaning, and pragmatic meaning to refer to these different types of informa-tion content. The definition of understanding proposed by [1] primarily appliesto pragmatic aspects of language use, where the task of the interlocutor is toidentify the intents of the speaker (or writer).2.3 The Structuralism of Contemporary NLU ApproachesIt may be useful at this point to consider how (one breed of) current NLUsystems “understand” language. A particularly successful approach to NLU(and NLP in general) at the moment is to use deep Transformer networksthat are trained on vast amounts of language data using a language modelingobjective, and then specialized (or simply applied) to perform specific tasks[4,21,8,3]. Such models implement a fundamentally structuralist — and even distributionalist [26,10]— view on language, where a model of how symbolsare combined and substituted for each other is learned by observing samplesof language use. The language modeling component (or, in somewhat oldermethods, the embeddings) encodes basic knowledge about the structure of thelanguage system, which can be employed for solving specific linguistic tasks.This is eminently well demonstrated in the recent work on zero-shot learning[33,3].The “understanding” these models have of language is entirely structural,and thus does not extend beyond the language system — or, more accurately,beyond the structural properties of the input modality. Speaking in terms ofthe different types of understanding we discussed above, this refers to conven-
Magnus Sahlgren, Fredrik Carlsson tional, intra-linguistic understanding. Note that this is a very intentional re-striction, since the learning objective of these models is optimized for learningdistributional regularities. It follows that if the input signal consists of severaldifferent modalities (e.g. language and vision, sound, touch, and maybe evensmell and taste), then the resulting structural knowledge will cover all of thesemodalities. A distributional model built from multimodal data will thus beable to employ its (structural) knowledge cross-modally, so that, e.g., visiondata can affect language knowledge and vice versa [31,28]. Such a multimodalmodel may be able to form images of input text, so that when given an inputsuch as “door” it can produce, or at least pair the text with, an image of adoor [23]. There have also been a fair amount of work on image captioning,where a model produces text based on an input image [12,14].This cross-modal ability cannot be described as purely intra-linguistic,since it covers several modalities. While we might not want to go as far asto call this referential understanding, it should certainly count as visual un-derstanding (of language). It is an interesting question how we will view distri-butional models when we start to incorporate more modalities in their trainingdata. What type of understanding would we say a language model has if itcan connect linguistic expressions to actions or situational parameters? Mul-timodality is mentioned by both [2] and [1] as a promising, if not necessary,research direction towards future NLU, and we agree; it seems like an unnec-essary restriction to only focus on text data when there is such an abundanceof other types of data available. However, there is also a more fundamentalquestion in relation to multimodality, and that is whether there are thingsthat cannot be learned by merely reading large bodies of text data?
The last section problematized the use of philosophically nebulous terms suchas “understanding” and “meaning,” and we argued that careless use of suchterms invites to an argumentation error that we labelled the singleton fallacy.In this section, we discuss some of the more concrete arguments against lan-guage models, and we argue that they inevitably collapse into dualism, whichwe consider to be a defeatist position for an applied computational field ofstudy. A consistent theme in the current critique of language models is theassumption that text data is insufficient as learning material in order to reachreal understanding of language. We have already noted the perils of using suchgeneral statements as “understanding of language,” but in this section we willtake a closer look at some of the specific arguments and thought experimentsthat motivate this assumption. he Singleton Fallacy 7 is learning about our world; we use language to communicate and store expe-riences, opinion, facts and knowledge. [1] constructs a more elaborate thoughtexperiment to make basically the same point. The thought experiment featuresa hyper-intelligent octopus (“O”) that inserts itself in the middle of a two-wayhuman communication channel, and that eventually (due to loneliness, we aretold) tries to pose as one of the human interlocutors.We are not convinced by the “octopus test.” A being that inserts itselfinto a communication channel because of loneliness obviously needs to havesome sort of prior concept of, experience with, and desire to take part in,communication. But perhaps we should only assume that O coincidentallyhappened to connect to the communication channel, and that it as a reflex,or due to programming, start to detect and process statistical patterns in thesymbol streams. In other words, O is a language model. We then agree thatif the “O model” only learns from the information being transmitted betweenA and B, it will not be able to generalize outside the context of A and B. Butwhat if O has also connected to the underwater transatlantic Internet cable,and has learned continuously from all the world’s web traffic (let’s restrict thelearning material to text context for now) since the cable was submerged, some20 years ago? Would it then be completely unrealistic to think that O wouldactually be able to respond something reasonably creative given the questionhow to use sticks to defend against a bear?[1] main point of the thought experiment is to argue that symbol manipu-lation by itself cannot transcend beyond the symbol system. This is certainlytrue from one perspective; the O language model, trained with the type ofobjective we currently use, would not be able to actually recognize a bear ora pair of sticks in the real world (since it “lives” in a purely textual world).On the other hand, it has certainly learned things about the world by readingthe entire Internet, so the model might be able to produce accurate textualdescriptions of both bears and sticks, based on information it has acquiredfrom reading, but it will not be able to act outside the textual modality. Thisseems to be a main point for both [1] and [2], and we agree that a purelytextual model will be constrained to a purely textual reality.This raises the question (again) whether there are any a priori constraintsfor what can and cannot be learned from language? Imagine, for the sake ofargument, that we are a brain in a vat with super-human abilities enablingus to read, understand, and remember everything that ever has been written(and maybe even hear everything that has been spoken) about painting, bears, The latest transatlantic telecommunications cable (TAT-14) was installed 2000. Magnus Sahlgren, Fredrik Carlsson sticks, and animal attacks. Imagine further that we at some later stage wouldbe implanted into a body; would we then learn anything new when we areactually able to perform the activities in question (e.g. painting and usingsticks to defend bear attacks)? We are at this point getting dangerously closeto being sucked into the infamous “Mary the super-scientist argument” [16]with questions about the reality of subjective experience, and nebulous termssuch as “qualia.”We suggest to stay well clear of this debate by using more precise terminol-ogy. If by “understanding” we refer to structural knowledge, then obviouslylanguage data will be enough; if we instead refer to the ability to performactions in the world, then, equally obviously, we need to also acquire motorskills and connect them to linguistic expressions. We completely agree that nolanguage model, no matter how much data it has been trained on and howmany parameters it has, by itself will be able to understand instructions in thesense of actually performing actions in the world. But we do believe that, inprinciple, a language model can acquire a similar understanding and mastery of language as a human language user.3.2 Communicative Intent and the Cartesian TheaterEven if we disagree that the octopus test disproves that language modelsactually understand language, we do think it points to one important aspectof language use: namely that of agency . That is, if O is only a language model,we would not expect it to spontaneously reply to some statement from A or B,since it has no incentives to do so. A language model only produces an outputwhen prompted; it has no will or intent of its own. Human language users, bycontrast, have plans, ambitions and intents, which drive their linguistic actions.We humans play language games in order to achieve some goal, e.g. to makesomeone open a door, or to insult someone by giving them an instructionthey cannot complete. Language models (and other current NLU techniques) execute language games when prompted to do so, but the intent is typicallysupplied by the human operator.Of course, there is one specific NLP application that is specifically con-cerned with intents: dialogue systems (or chatbots, to use the more popularvernacular). Consider a simple chatbot that operates after a given plan, forexample to call a restaurant and book a table for dinner. Such a chatbot wouldnot only be able to act according to its own intents, but would presumablyalso be able to recognize its interlocutor’s intents (by simply classifying userresponses according to a set of given intent categories). If we were to take theposition that pragmatics is the necessary requirement for understanding, as[1] seems to do, it would lead to the slightly odd consequence that a simple(perhaps even completely scripted) dialogue system would count as havinga fuller understanding of language than a language model that is capable ofnear-human performance on reading comprehension tasks. Such a comparisonis of course nonsensical, since these systems are designed to perform differ- he Singleton Fallacy 9 ent types of tasks with different linguistic requirements and thus cannot becompared on a single scale of understanding (there is no such thing).This reductio ad absurdum example demonstrates the perils of strict defini-tions, such as the M ⊆ E × I formula for meaning suggested by [1]. The mainproblem with invoking intents in a strict mathematical definition is that theconcept is extremely vague. In operationalizations of intent recognition (e.g. inchatbots), we operate with a limited and predefined taxonomy of intents thatare relevant to a specific use case, but in open language use it is less clear howto assign intents to expressions. At what granular level do intents reside — isit at the word level, sentence level, or speaker turn (and how does that trans-late into text, where a turn sometimes is an entire novel)? And is the subjectalways in a privileged position to identify her intents? This is questioned inparticular by postmodern critical theory [22], and there are plenty of examplesin the public debate where the speaker’s interpretation of her utterance differsfrom that of commentators (“I did not mean it like that” is not an uncommonexpression). We can even sometimes find ourselves in the curious position ofbeing simultaneously correct and incorrect about a speaker’s intent, which isthe case with our example with “door!” in the previous section (if I don’topen the door and instead take the person’s load, I would have misinterpretedthe literal intention with the utterance, but still have correctly interpretedthe person’s need). This lends a certain hermeneutic flavor to the concept ofintent, which makes it slightly inconvenient to use in mathematical formulas.We believe that pragmatics is no less, and perhaps even more, a product ofconventionalization processes in language use than other types of understand-ing. This may be an uncontroversial statement, but it points to a question thatcertainly is not, namely whether pragmatic understanding can be acquired byonly observing the linguistic signal. [1] clearly think not, and they argue thatgrasping intents requires extralinguistic knowledge. For a simple case such as“door!”, this may entail being able to recognize the object referred to, andpossibly also knowledge about how it operates. For more abstract concepts,[1] claim the existence of “abstract” or “hypothetical world[s] in the speaker’smind.” There is an apparent risk that the invocation of minds at this pointcollapses into Cartesian materialism [5], which constructs the mind as a kindof control room (also referred to as a “Cartesian theater”) where the subject —the self or homunculus — observes, interprets, and controls the outside world.We are not sure what [1] position would be with respect to such a view, but itis easy to see how it would posit the intents with the homunculus, which wouldthen use the linguistic generator to express its intents in the form of language— which is, in our understanding, more or less exactly what [1] propose.3.3 How the Current Critique Rekindles DistributionalismThe distinction between form (i.e. the linguistic signal) and meaning (i.e. theintent) is central to [1], and they claim that a model (or more general, a sub- M is meaning, E is expression, and I is intent.0 Magnus Sahlgren, Fredrik Carlsson ject) “trained purely on form will not learn meaning”. But if meanings canhave an effect on the form (which we assume everyone agrees on), then amodel should at least be able to observe, and learn, these effects. The pointhere is that there needs to be an accessible “linguistic correlate” to whatevermeaning process we wish to stipulate, since otherwise communication wouldnot be possible. Thus, in the sense that intentions (meanings) have effects onthe linguistic signal (form), it will be possible to learn these effects by simplyobserving the signal. It is precisely this consideration that underlies the distri-butional approach to semantics. [11] provides the most articulate formulationof this argument: “As Leonard Bloomfield pointed out, it frequently happensthat when we do not rest with the explanation that something is due to mean-ing, we discover that it has a formal regularity or ‘explanation.’ It may still be‘due to meaning’ in one sense, but it accords with a distributional regularity.”Somewhat ironically, [1] objections to distributional approaches in the formof language models — that meaning is something unobtainable from simplyobserving the linguistic signal — thus effectively brings us back to the originalmotivation for using distributional approaches in computational linguistics inthe first place: if meanings really are unobtainable from the linguistic signal,then all we can do from the linguistic perspective is to describe the linguisticregularities that are manifestations of the external meanings. In the last section, we argued that the currently dominating approach in NLU— distributionally-based methods — originate in a reaction against preciselythe kind of dualism professed by [1]. But even if the motivation for the currentmain path in NLU thereby should be clear, the question still remains whetherthis current path is feasible in the long run, or whether it will eventually leadto a dead-end. This section discusses what the dead-end might look like, andwhat that means for the hill-climbing question.4.1 The Chinese Room and Philosophical Zombies[1] main concern seems to be that our current research direction in NLU willlead to something like a Chinese Room. The Chinese Room argument is oneof the classical philosophical thought experiments, in which [27] invites usto imagine a container (such as a room) populated with a person who doesnot speak Chinese, but who has access to a set of (extensive) instructionsfor manipulating Chinese symbols, such that when given an input sequenceof Chinese symbols, the person can consult the instructions and produce anoutput that for a Chinese speaker outside the room seems like a coherentresponse. In short, the Chinese Room is much like our current language models.The question is whether any real understanding takes place in the symbolmanipulating process? he Singleton Fallacy 11
We will not attempt to contribute any novel arguments to the vast liter-ature that exists on the Chinese Room argument, but we will point to thecounter-argument commonly known as the “system reply” [27]. This responsenotes that for the observer of the room (whether it is an actual room, a com-puter, or a human that has internalized all the instructions) it will seem as if there is understanding — or at least language proficiency — going on in theroom. Similarly, for the user of a future NLU system (and perhaps even forcertain current users of large-scale language models), it may seem as if thesystem understands language, even if there is “only” a language model on theinside. We can of course always question whether there is any “real” under-standing going on, but if the absence or presence of this “real” understandinghas no effect on the behavior of the system, it will be a mere epiphenomenonthat need not concern us. The attentive reader will notice that this is a variantof what is commonly referred to as the Systems Reply to the Chinese Roomargument, and is essentially the same argument as the Philosophical Zombies(or Zimboes, as Dennett calls them [6]) that behave exactly like human beings,except that they have no consciousness. Such a being would not be able to really “want,” “believe,” and “mean” things, but we would probably still bebetter off using these terms to explain their behavior.4.2 Understanding and the Intentional StanceThis is what Dennett refers to as “Intentional Stance” [7]: we ascribe intention-ality to a system (or more generally, entities) in order to explain and predictits behavior. For very basic entities, such as a piece of wood, it is normallysufficient to ascribe physical properties to it in order to explain its behavior(it has a certain size and weight, and will splinter if hit by a sharp object). Forslightly more complex entities, such as a chainsaw, we also ascribe functionsthat explain its expected behavior (if we pull the starter cord, the chain willstart revolving along the blade, and if we put it against a piece of wood, it willsaw through the wood). For even more complex entities, such as animals andhuman beings, it is normally not enough with physical properties and func-tional features to explain and predict their behavior. We also need to invokeintentionality — i.e. mental capacities — in order to fully describe them. Notethat we occasionally do this also with inanimate objects, in particular whentheir behavior starts to deviate from the expected functions: “the chainsawdoesn’t want to start!” We are in such cases not suggesting that the chainsawsuddenly has become conscious in the same way as humans are conscious; it issimply more convenient to adopt an intentional stance in the absence of sim-pler functional explanations. It would probably be possible to provide purelyfunctional, perhaps even mechanistic explanations at some very basic neu-rophysiological level for every action that an animal or human makes, but itwould be quite cumbersome. The intentional stance is by far the more effectiveperspective.
Dennett’s point is that consciousness is not an extra ingredient in addi-tion to the complexity of a system: consciousness is the complexity of thesystem. Our point is that understanding is also not an extra ingredient of asymbol manipulation system: “understanding” is a term we use to describethe complexity of such a system. Let us take a well-defined and clearly delim-ited possible use case for future androids as an example; a robotic doorman.Imagine that our doorman is equipped with a future version of a pre-trainedTransformer language model trained on multimodal data (e.g. both text andimages) that has been fine-tuned on a relevant intent recognition task, andimplemented in a dialogue engine with an interaction agenda (e.g. to greet,help, and make small talk with the building’s residents). Our doorman wouldprobably have a good grasp of conventional as well as referential (or at leastvisual) meaning, and it would be able to both recognize (a limited set of) in-tents and act according to its own (limited set of) intents. Imagine further thatour doorman is capable of performing a limited set of actions, such as openingthe door, carrying a package, and calling for the elevator. In its very limiteddomain, our android would be more or less functionally indistinguishable froma human doorman. If someone were to approach the entrance carrying a stackof heavy boxes, and shouting “door!”, our doorman would presumably be ableto use the visual cues of the boxes to map the utterance to an intent categorysuch as “request for assistance” rather than “ostensive definition,” and wouldthus be able to select a proper set of actions, such as opening the door oroffering to carry the boxes. It seems strange to not use the explanation “theandroid understood the utterance” in such a case, even if it would in theory bepossible to provide a complete functional description of the activations in thedoorman’s internal model. From our perspective, it would be much strangerto claim that the doorman did not actually understand the utterance, since itlacks a mapping from form to meaning.4.3 MontologyOur point is that when the behavior of an NLU system becomes sufficientlycomplex, it will be easier to explain its behavior using intentional terms suchas “understanding,” than to use a purely functional explanation. We positthat we are not far from a future where we habitually will say that machinesand computer systems understand and misunderstand us, and that they haveintents, wishes, and even feelings. This does not necessarily mean that theyhave the same type of understanding, intents, opinions, and feelings as humansdo, but that their behavior will be best explained by using such terms. Thesame situation applies to animals (at least for certain people), and maybeeven to plants (with the same caveat). We argue that the question whetherthere really is understanding going on, i.e. whether there is also some mappingprocess executed in addition to the language use or behavior, is redundant inmost situations. We can probably look forward to interesting and challengingethical and philosophical discussions about such matters in the future, but he Singleton Fallacy 13 for most practical purposes, it will be of neither interest nor consequence toquestion whether an NLU system is a mere symbol manipulator or a “true”understander.It is important to understand which hill we are currently climbing, and why.As we have argued in this paper, the current hill, on which language modelsand most current NLU approaches live, is based on distributional sediment,which has amassed as a reaction to a dualistic view that posits understandingand meaning in a mental realm outside of language. The purpose of this hillis not to replicate a human in silico, but to devise computational systemsthat can manipulate linguistic symbols in a manner similar to humans. Thisis not the only hill in the NLU landscape — there are hills based on logicalformalisms, and hills based on knowledge engineering — but based on theempirical evidence we currently have, the distributional hill has so far provento be the incomparably most accessible ascent.It should thus come as no surprise that our answer to the question whetherwe are climbing the right hill is a resounding yes . We do, however, agree thatwe should not only climb the most accessible hills, and that we as a field needto encourage and make space for alternative and complementary approachesthat may not have come as far yet. The most likely architecture for futureNLU solutions will be a combination of different techniques, originating fromdifferent hills of the NLU landscape. This paper has argued that much of the current debate on language modelsrests on what we have referred to as the singleton fallacy: the assumption thatlanguage, meaning, and understanding are single and uniform phenomena thatare unobtainable by (current) language models. By contrast, we have arguedthat there are many different types of language use, meaning and understand-ing, and that (current) language models are build with the explicit purpose ofacquiring and representing one type of structural understanding of language.We have argued that such structural understanding may cover several differentmodalities, and as such can handle several different types of meaning. Impor-tantly, we see no theoretical reason why such structural knowledge would beinsufficient to count as understanding. On the contrary, we believe that asour language models and NLU systems become simultaneously more profi-cient and more complex, users will have no choice but to adopt an intentionalstance to these systems, upon which the question whether there is any “true”understanding in these systems becomes redundant.We are well aware that the current debate is of mainly philosophical inter-est, and that the practical relevance of this discussion is small to non-existent.A concrete suggestion to move the discussion forward is to think of ways toverify or falsify the opposing positions. We distinctively feel that the burdenof proof lies with the opposition in this case; are there things we cannot dowith language unless we have “real” (as opposed to structural) understanding?
Which criteria should we use in order to certify NLU solutions as being “un-derstanders” rather than mere symbol manipulators? Being able to solve ourcurrent General Language Understanding Evaluation benchmarks (i.e. GLUEand SuperGLUE) obviously does not seem to be enough for the critics.From the ever-growing literature on “BERTology” [25] we know that thereare tasks and linguistic phenomena that current language models handle badly,if at all, and we also know that they sometimes “cheat” when solving certaintypes of tasks [20]. These are extremely valuable results, which will further thedevelopment of language models and other types of NLU solutions. However,these failures do not mean that language models are theoretically incapableof handling these tasks; it only means that our current models (i.e. currenttraining objectives, architectures, parameter settings, etc.) are incapable ofrecognizing certain phenomena. Which, we might add, is to be expected, giventhe comparably simple training objectives we currently use.
References ,34–48 (2020)10. Gastaldi, J.L.: Why can computers understand natural language? the structuralist imageof language behind word embeddings. Philosophy & Technology (2020)11. Harris, Z.: Distributional structure. Word · , 842–866 (2020)26. Sahlgren, M.: The distributional hypothesis. Italian Journal of Linguistics (1), 33–54(2008)27. Searle, J.R.: Minds, brains, and programs. Behavioral and Brain Sciences3