The Pragmatics of Indirect Commands in Collaborative Discourse
TThe Pragmatics of Indirect Commands in Collaborative Discourse
Matthew Lamm ∗ Stanford LinguisticsStanford NLP Group [email protected]
Mihail Eric ∗ Stanford Computer ScienceStanford NLP Group [email protected]
Abstract
Today’s artificial assistants are typically prompted to perform tasks through direct, imperativecommands such as
Set a timer or Pick up the box . However, to progress toward more natural ex-changes between humans and these assistants, it is important to understand the way non-imperativeutterances can indirectly elicit action of an addressee. In this paper, we investigate command typesin the setting of a grounded, collaborative game. We focus on a less understood family of utterancesfor eliciting agent action, locatives like
The chair is in the other room , and demonstrate how theseutterances indirectly command in specific game state contexts. Our work shows that models withdomain-specific grounding can effectively realize the pragmatic reasoning that is necessary for morerobust natural language interaction.
A major goal of computational linguistics research is to enable organic, language-mediated interactionbetween humans and artificial agents. In a common scenario of such interaction, a human issues acommand in the imperative mood—e.g.
Put that there or Pick up the box —and a robot acts in turn (Bolt,1980; Tellex et al., 2011; Walter et al., 2015). While this utterance-action paradigm presents its own setof challenges (Tellex et al., 2012), it greatly simplifies the diversity of ways in which natural languagecan be used to elicit action of an agent, be it human or artificial (Clark, 1996; Portner, 2007; Kaufmannand Schwager, 2009; Condoravdi and Lauer, 2012; Kaufmann, 2016). Most clause types, even vanilladeclaratives, instantiate as performative requests in certain contexts (Austin, 1975; Searle, 1989; Perraultand Allen, 1980).In this work, we employ machine learning to study the use of performative commands in the Cardscorpus, a set of transcripts from a web-based game that is designed to elicit a high degree of linguisticand strategic collaboration (Djalali et al., 2011, 2012; Potts, 2012). For example, players are tasked withnavigating a maze-like gameboard in search of six cards of the same suit, but since a player can hold atmost three cards at a time, they must coordinate their efforts to win the game.We focus on a subclass of performative commands that are ubiquitous in the Cards corpus: Non-agentive declaratives about the locations of objects, e.g. “The five of hearts is in the top left corner,”hereafter referred to as locatives . Despite that their semantics makes no reference to either an agentor an action—thus distinguishing them from conventional imperatives (Condoravdi and Lauer, 2012)—locatives can be interpreted as commands when embedded in particular discourse contexts. In the Cardsgame, it is frequently the case that an addressee will respond to such an utterance by fetching the cardmentioned.Following work on the context-driven interpretation of declaratives as questions (Beun, 2000), wehypothesize that the illocutionary effect of a locative utterance is a function of contextual features thatvariably constrain the actions of discourse participants. To test this idea, we identify a set of 94 loca-tive utterances in the Cards transcripts that we deem to be truly ambiguous, out of context, between ∗ Authors contributed equally a r X i v : . [ c s . C L ] S e p nformative and command readings. We then annotate their respective transcripts for a simplified repre-sentation of the tabular common ground model of Malamud and Stephenson (2015). Here, we identifythe common ground with the state of a game as reflected by the utterances made by both players up to aspecific point in time. Finally, we train machine learning classifiers on features of the common groundto predict whether or not the addressee will act in response to the utterance in question. Through theseexperiments we discover a few very powerful contextual features that predict when a locative utterancewill be interpreted as a command. The subject of indirect commands, of which the locative utterances we study are an example, has beenextensively analyzed in terms of speech act and decision theory (Austin, 1975; Clark, 1979; Perrault andAllen, 1980; Allen and Perrault, 1980; Searle, 1989). In Portner’s (2007) formal model, imperatives areutterances whose conventional effect updates an abstract “to-do list” of an addressee. More recent debatehas asked whether this effect is in fact built into the semantics of imperatives, or if their directive forceis resolved by pragmatic reasoning in context (Condoravdi and Lauer, 2012; Kaufmann, 2016).The present work synthesizes these intuitions from the theory of commands with recent computa-tional work on natural language pragmatics (Vogel et al., 2013, 2014; Degen et al., 2013) and collabo-rative dialogue (Chai et al., 2014). We are particularly influenced by previous work demonstrating thecomplexity of pragmatic phenomena in the Cards corpus (Djalali et al., 2011, 2012; Potts, 2012).
The Cards corpus is a set of 1,266 transcripts from a two-player, collaborative, web-based game. TheCards corpus is well-suited to studying the pragmatics of commands because it records both utterancesmade as well as the actions taken during the course of a game.At the start of the game, player locations are randomly initialized on a two-dimensional, grid-stylegame board. Cards from a conventional 52-card deck are scattered randomly throughout the board.Players are prompted with the following task:Gather six consecutive cards of a particular suit (decide which suit together), or determinethat this is impossible. Each of you can hold only three cards at a time, so youll have tocoordinate your efforts. You can talk all you want, but you can make only a limited numberof moves.In addition to the fact that players can only hold three cards at a time, the game is further-constrainedin ways that stimulate highly collaborative talk. In particular, while players can see their own location,they cannot see the locations of their partners and so must inquire about them. Players can only seecards within a small neighborhood around their respective locations, and so must explore the board tofind relevant cards. Moreover, while some walls are visible, others are invisible and so lead to surpriseperturbations in the course of exploring the gameboard.
Commands in the Cards corpus can be coarsely divided into ones which make reference to an action witha second person agent, and those which do not.The first of these categories is comprised of imperatives and a variety so-called performative com-mands: Utterances which act as commands in context but whose clause type is not conventionally asso-ciated with the effect of commanding (Clark, 1979; Searle, 1989; Wierzbicka, 1991). For example, withrespect to picking up cards:ick it up!pick up the 9or hell, grab the 234 of Dok when you get here pick up the 8Hi think you should pick up the 3h, 5h, and 8hso if you can find 5S,6S,7S that would be greatWith respect to dropping (or not dropping) cards:drop the 2keepp the 3no dont drop it.)ok drop the 7 i guessget rid of 6d. i found 7hso if you come and drop the 8h and pick up the 6h we are goodWith respect to conversational actions (some of these utterances are shortened for clarity):tell me where it istalk to me dude [...]tell me if you see 5 or 6don’t just say “a lot of cards” [...]awesome let me know once you have it.Imperatives and performatives that mention agents contrast with the lesser understood subclass ofperformative commands that are the focus of this work. Utterances like “The five of hearts is in the topcorner” do not even encode an action with respect to the object mentioned, let alone an agent, but cannevertheless be used to elicit action of an addressee in certain contexts.As a motivating example, consider the following exchange between two players describing theirrespective hands:P1: 3h, 4h and ksP2: i have a queen of diamonds and ace of clubP2: we have a mess lolDespite Player 2’s concerns, a strategy emerges shortly thereafter when Player 1 finds an additional heartscard: P1: i have 3h,4h,6hP2: ok so we need to collect hearts thenAt this point in the transcript, all that has been committed to the common ground is that Player 1 has afull hand of three proximal hearts cards that could be relevant to a winning strategy, and Player 2 has twonon-hearts cards. This is the very next utterance in the exchange:P1: there is a 5h in the very top left cornerPlayer 2 is seen immediately hereafter to navigate to the top left corner, pick up the five of hearts, andconfirm:P2: ok i got it :)In this exchange, Player 2 appears to understand not only that the five of hearts is relevant to the winningstrategy of six consecutive hearts, but also that it makes more sense for her to act on information aboutits location than it does for Player 1 to do so. a) pickup (b) drop (c) conversation (d) search
Figure 1: For each action, the distribution of command typesThis discourse encapsulates the collaborative reasoning pattern described by Perrault and Allen(1980). The speaker assumes that the addressee is a cooperative agent. Thus, the addressee shares inthe goal of attaining a winning game state and will act in a way so as to realize that goal. Recognizingthe fact that the speaker would have to drop cards relevant to the goal to pick up the card at issue, a coop-erative addressee will infer that she should act by picking it up instead. In this way, locative utterancescan be indirectly used as commands.The distribution of command types for a subset of actions ( pickup , drop , conversation , and search )is displayed in Figure 1. As depicted, for the majority of actions compelled by a speaker and taken by anaddressee, the imperative is the predominant command strategy, followed by non-locative performatives.However locative commands appear to be the dominant strategy for eliciting card pickups in the corpus,constituting nearly half of all such commands observed.This pattern demonstrates that for certain kinds of actions, it is quite natural to use the least direct,most context-dependent command strategy to elicit action of an addressee. We seek to understand how the discourse context of a game can influence the interpretation of locativeutterances as commands. We therefore construct a binary classification task whereby we test how the roleof a locative utterance can be resolved in context, evidenced by the actions that are taken as follow-upsto the utterance. In our task, one label denotes addressee follow-up in the form of acting to pick up thecard in question, signaling her intention to act, or asking a clarifying question about its whereabouts. Thesecond label denotes that either the speaker acts on their own utterance or neither agent does.
Using a random sample of 200 transcripts from the corpus, we identify instances where a locative utter-ance is made and we annotate the common ground up to this utterance. This yields 55 distinct transcriptsconstituting 94 utterances with this particular phenomenon.Our common ground annotations include the following information about the game state as indicatedby players’ utterances: cards in the players hands, player location, known information about the existenceor location of cards, strategic statements made by players about needed cards, and whether a player isable to act with respect to an at-issue card.
Our aim in devising this task is to investigate connection between common ground knowledge and theillocutionary effects of locative utterances. We therefore train a standard logistic regression classifierand experiment with a few carefully designed features that encode constraints on player action, andwhich should hypothetically trigger the interpretation of locative utterances as indirect commands. Weexperiment with the following features: odel F Random
Bigram
Explicit Goal + Full Hand 77.7
Table 1: F performance as reported on the test set. Note our baselines are italicized. • Edit Distance : We use the minimal number of edits for an optimal solution as a feature. Given thecards in the players’ hands at a given point in the game, we can define an optimal solution basedon the number of edits that must be made to the hands to achieve that optimal solution. An edit isdefined as either picking up or dropping a card, and each such action has a cost of 1. An optimalsolution is defined as the one that requires the minimal number of edits given the current hands.For example, if player 1 has a 2H, 3H, and 4H and player 2 has a 6H and a 7H, the optimal solutionis the 2H, 3H, 4H, 5H, 6H, and 7H. Such a solution requires a single edit because player 2 simplyhas to pick up a 5H. This feature seeks to capture the intuition that an addressee should tend to actwith respect to a card when the edit distance is not particularly high and hence the game is near awinning state. • Explicit Goal : This binary feature is triggered in two cases: 1) When the suit of card mentionedmatches the agreed-upon suit strategy in the common ground and 2) When the card mentionedappears in the set of cards the addressee claims to need. This models the prediction that locativeutterances are more likely to be indirect commands when they are relevant to a well-defined goal. • Full Hands : This binary feature is triggered when the speaker has three cards of the same suitas the card mentioned, and which are associated with some winning six-card straight, but theaddressee does not. This models the prediction that locative utterances are likely to be indirectcommands when they provide information relevant to winning, but only the addressee can act assuch.Single-feature classifiers are compared against a number of baselines to help benchmark our predic-tive task. Our first baseline, which is context-agnostic, seeks to capture the intuition that the role of alocative utterance is entirely ambiguous when considered in isolation. This baseline predicts the agentfollow-up using a Bernoulli distribution weighted according to the class priors of the training data.The second baseline incorporates surface-level dialogue context via bigram features of all the utter-ance exchanged between players up to and including the locative utterance. We also experimented witha unigram baseline but found that its performance was inferior to that of the bigram.
We test our common-ground features one at a time with our logistic regression model, as we are interestedin seeing how successfully they encode agents’ pragmatic inferences. We also combine the two best-performing common-ground features. We report the results of our experiments using an F measure anda 0.8/0.2 train/test split of our data in Table 1.We see that of our two baselines, the bigram model performs better. This bigram model also uses2,916 distinct lexical features which makes it a highly overspecified model for our moderate data size.We find that our single-feature context-sensitive models both significantly outperform our baselines.Our Explicit Goal feature outperforms the Edit Distance feature by about 14%, which indicates that loca-tive utterances are often interpreted as commands in the presence of an explicit, common goal. The FullHands feature outperforms the Explicit Goal feature by about 6%. This strongly suggests that constraintson speaker action play a role in determining the illocutionary effect of a locative utterance. An addresseef such an utterance will tend to act accordingly when their partner cannot pick up the card mentioned,and when the card in question brings them closer to winning the game. We find that combining the Ex-plicit Goal and Full Hands features improved performance over only using the Explicit Goal feature butreduced overall performance. This could be because the two features encode some common informationabout the agents’ pragmatic implicatures during the game, and hence their correlative effects tend todegrade the combined model performance. In this work, we have performed an extensive study of command types as present in the Cards corpus.Using the corpus as a test bed for grounded natural language interaction among agents with a shared goal,we describe a variety of utterances that may function as indirect commands when regarded in context.In particular, locative utterances, which are not conventionally associated with command interpretations,are shown to operate as commands when considered in relation to situational constraints in the courseof collaborative interaction. We develop a predictive task to show that models with carefully-designedfeatures incorporating game state information can help agents effectively perform such pragmatic infer-ences.
The authors would like to thank Christopher Potts and all of the anonymous reviewers for their valuableinsights and feedback.
References
Allen, J. F. and C. R. Perrault (1980). Analyzing intention in utterances.
Artificial intelligence 15 (3),143–178.Austin, J. L. (1975).
How to do things with words . Oxford University Press.Beun, R.-J. (2000). Context and form: Declarative or interrogative, that is the question.
Abduction,Belief, and Context in Dialogue: Studies in Computational Pragmatics 1 , 311–325.Bolt, R. A. (1980).
Put-that-there: Voice and gesture at the graphics interface , Volume 14. ACM.Chai, J. Y., L. She, R. Fang, S. Ottarson, C. Littley, C. Liu, and K. Hanson (2014). Collaborative efforttowards common ground in situated human-robot dialogue. In
Proceedings of the 2014 ACM/IEEEinternational conference on Human-robot interaction , pp. 33–40. ACM.Clark, H. H. (1979). Responding to indirect speech acts.
Cognitive psychology 11 (4), 430–477.Clark, H. H. (1996).
Using language . Cambridge: Cambridge University Press.Condoravdi, C. and S. Lauer (2012). Imperatives: Meaning and illocutionary force.
Empirical issues insyntax and semantics 9 , 37–58.Degen, J., M. Franke, and G. J¨ager (2013). Cost-based pragmatic inference about referential expressions.In
CogSci .Djalali, A., D. Clausen, S. Lauer, K. Schultz, and C. Potts (2011, November). Modeling expert effectsand common ground using Questions Under Discussion. In
Proceedings of the AAAI Workshop onBuilding Representations of Common Ground with Intelligent Agents , Washington, DC. Associationfor the Advancement of Artificial Intelligence.jalali, A., S. Lauer, and C. Potts (2012). Corpus evidence for preference-driven interpretation. InM. Aloni, V. Kimmelman, F. Roelofsen, G. W. Sassoon, K. Schulz, and M. Westera (Eds.),
Proceed-ings of the 18th Amsterdam Colloquium: Revised Selected Papers , Berlin, pp. 150–159. Springer.Kaufmann, M. (2016). Fine-tuning natural language imperatives.
Journal of Logic and Computation ,exw009.Kaufmann, S. and M. Schwager (2009). A unified analysis of conditional imperatives. In
Semantics andLinguistic Theory , Volume 19, pp. 239–256.Malamud, S. A. and T. Stephenson (2015). Three ways to avoid commitments: Declarative force modi-fiers in the conversational scoreboard.
Journal of Semantics 32 (2), 275–311.Perrault, C. R. and J. F. Allen (1980). A plan-based analysis of indirect speech acts.
ComputationalLinguistics 6 (3-4), 167–182.Portner, P. (2007). Imperatives and modals.
Natural Language Semantics 15 (4), 351–383.Potts, C. (2012). Goal-driven answers in the Cards dialogue corpus. In N. Arnett and R. Bennett (Eds.),
Proceedings of the 30th West Coast Conference on Formal Linguistics , Somerville, MA, pp. 1–20.Cascadilla Press.Searle, J. R. (1989). How performatives work.
Linguistics and philosophy 12 (5), 535–558.Tellex, S., P. Thaker, R. Deits, T. Kollar, and N. Roy (2012). Toward information theoretic human-robotdialog. In
Robotics: Science and Systems , Volume 2, pp. 3.Tellex, S. A., T. F. Kollar, S. R. Dickerson, M. R. Walter, A. Banerjee, S. Teller, and N. Roy (2011).Understanding natural language commands for robotic navigation and mobile manipulation.Vogel, A., A. G´omez Emilsson, M. C. Frank, D. Jurafsky, and C. Potts (2014, July). Learning to reasonpragmatically with cognitive limitations. In
Proceedings of the 36th Annual Meeting of the CognitiveScience Society , Wheat Ridge, CO, pp. 3055–3060. Cognitive Science Society.Vogel, A., C. Potts, and D. Jurafsky (2013, August). Implicatures and nested beliefs in approximateDecentralized-POMDPs. In
Proceedings of the 2013 Annual Conference of the Association for Com-putational Linguistics , Stroudsburg, PA, pp. 74–80. Association for Computational Linguistics.Walter, M. R., M. Antone, E. Chuangsuwanich, A. Correa, R. Davis, L. Fletcher, E. Frazzoli, Y. Fried-man, J. Glass, and J. P. How (2015). A situationally aware voice-commandable robotic forklift workingalongside people in unstructured outdoor environments.
Journal of Field Robotics 32 (4), 590–628.Wierzbicka, A. (1991).