Planning, Inference and Pragmatics in Sequential Language Games
PPlanning, Inference and Pragmatics in Sequential Language Games
Fereshte Khani
Stanford University [email protected]
Noah D. Goodman
Stanford University [email protected]
Percy Liang
Stanford University [email protected]
Abstract
We study sequential language games in whichtwo players, each with private information,communicate to achieve a common goal. Insuch games, a successful player must (i) in-fer the partner’s private information from thepartner’s messages, (ii) generate messages thatare most likely to help with the goal, and (iii)reason pragmatically about the partner’s strat-egy. We propose a model that captures allthree characteristics and demonstrate their im-portance in capturing human behavior on anew goal-oriented dataset we collected usingcrowdsourcing.
Human communication is extraordinarily rich. Peo-ple routinely choose what to say based on theirgoals (planning), figure out the state of the worldbased on what others say (inference), all while tak-ing into account that others are strategizing agentstoo (pragmatics). All three aspects have been stud-ied in both the linguistics and AI communities. Forplanning, Markov Decision Processes and their ex-tensions can be used to compute utility-maximizingactions via forward-looking recurrences (e.g., Vo-gel et al. (2013a)). For inference, model-theoreticsemantics (Montague, 1973) provides a mechanismfor utterances to constrain possible worlds, and thishas been implemented recently in semantic parsing(Matuszek et al., 2012; Krishnamurthy and Kollar,2013). Finally, for pragmatics, the cooperative prin-ciple of Grice (1975) can be realized by models inwhich a speaker simulates a listener—e.g., Franke(2009) and Frank and Goodman (2012). Find B2 Find B2
B ?B ?C ? P letter view ? 2? 3? 2 P digit view P letter : square P digit : circle P letter : click (1,3) Planning:
Let mefirst try square ,which is just onepossibility.
Inference:
Thesquare’s letter mustbe B.
Pragmatics:
Thesquare’s digit can-not be 2.
Figure 1: A game of InfoJigsaw played by two hu-man players. One of the players ( P letter ) only seesthe letters, while the other one ( P digit ) only sees thedigits. Their goal is to identify the goal object, B2,by exchanging a few words. The clouds show thehypothesized role of planning, inference, and prag-matics in the players’ choice of utterances. In thisgame, the bottom object is the goal (position (1, 3)).There have been a few previous efforts in the lan-guage games literature to combine the three aspects.Hawkins et al. (2015) proposed a model of commu-nication between a questioner and an answerer basedon only one round of question answering. Vogel etal. (2013b) proposed a model of two agents playinga restricted version of the game from the Cards Cor-pus (Potts, 2012), where the agents only communi-cate once. In this work, we seek to capture all threeaspects in a single, unified framework which allows Specifically, two agents must both co-locate with a specificcard. The agent which finds the card sooner shares the cardlocation information with the other agent. a r X i v : . [ c s . C L ] M a y or multiple rounds of communication.Specifically, we study human communication ina sequential language game in which two players,each with private knowledge, try to achieve a com-mon goal by talking. We created a particular sequen-tial language game called InfoJigsaw (Figure 1). InInfoJigsaw, there is a set of objects with public prop-erties (shape, color, position) and private properties(digit, letter). One player ( P letter ) can only see theletters, while the other player ( P digit ) can only see thedigits. The two players wish to identify the goal ob-ject, which is uniquely defined by a letter and digit.To do this, the players take turns talking; to encour-age strategic language, we allow at most two Englishwords at a time. At any point, a player can end thegame by choosing an object.Even in this relatively constrained game, we cansee the three aspects of communication at work.As Figure 1 shows, in the first turn, since P letter knows that the game is multi-turn, she simply says square , for if the other player does not click on thesquare, she can try the bottom circle in the next turn( planning ). In the second turn, P digit infers from square that the square’s letter is probably B ( in-ference ). As the digit on the square is not a , shesays circle . Finally, P letter infers that digits of cir-cles are , and in addition she infers from circle that the digit on the square is not a as otherwise, P digit would have clicked on it ( pragmatics ). There-fore, she correctly clicks on (1,3).In this paper, we propose a model that capturesplanning, inference, and pragmatics for sequentiallanguage games, which we call PIP. Planning re-currences look forward, inference recurrences lookback, and pragmatics recurrences look to simplerinterlocutors’ model. The principal challenge isto integrate all three types in a coherent way;we present a “two-dimensional” system of recur-rences to capture this. Our recurrences bottom outin a very simple literal semantics (e.g., context-independent meaning of circle ), and we rely onthe structure of recurrences to endow words withtheir rich context-dependent meaning. As a result,our model is very parsimonious and only has four(hyper)parameters.As our interest is in modeling human communi-cation in sequential language games, we evaluatePIP on its ability to predict how humans play In- foJigsaw. We paired up workers on Amazon Me-chanical Turk to play InfoJigsaw, and collected atotal of 1680 games. Our findings are as follows:(i) PIP obtains higher log-likelihood than a base-line that chooses actions to convey maximum infor-mation in each round; (ii) PIP obtains higher log-likelihood than ablations that remove the pragmaticor the planning components, supporting their im-portance in communication; (iii) PIP is better thanan ablation with a truncated inference componentthat forgets the distant past only for longer games,but worse for shorter games. The overall conclu-sion is that by combining a very simple, context-independent literal semantics with an explicit modelof planning, inference, and pragmatics, PIP obtainsrich context-dependent meanings that correlate withhuman behavior.
In a sequential language game, there are two play-ers who have a shared world state w . In addition,each player j ∈ { +1 , − } has a private state s j .At each time step t = 1 , , . . . , the active player j ( t ) = 2( t mod − (which alternates) choosesan action (including speaking) a t based on its policy π j ( t ) ( a t | w, s j ( t ) , a t − ) . Importantly that player j ( t ) can see her own private state s j ( t ) , but not thepartner’s s − j ( t ) . At the end of the game (definedby a terminating action), both players receive utility U ( w, s +1 , s − , a t ) ∈ R . The utility consists of apenalty if players did not reach the goal and a re-ward if they reached the goal along with a penaltyfor each action. Because the players have a commonutility function that depends on private information,they must communicate the part of their private in-formation relevant for maximizing utility. In orderto simplify notation, we use j to represent j ( t ) inthe rest of the paper. InfoJigsaw.
In InfoJigsaw (see Figure 1 for an ex-ample), two players try to identify a goal object, buteach only has partial information about its identity.Thus, in order to solve the task, they must communi-cate, piecing their information together like a jigsaw One could in principle solve for an optimal communicationstrategy for InfoJigsaw, but this would likely result in a solutionfar from human communication. a) P digit view (b) P letter view Figure 2: Chat interface that Amazon Mechanical Turk (AMT) workers use to play InfoJigsaw (for read-ability, objects with the goal digit/letter are bolded).puzzle. Figure 2 shows the interface that humans useto play the game.More formally, the shared world state w includesthe public properties of a set of objects: position ona m × n grid, color (blue, yellow, green), and shape(square, diamond, circle). In addition, w containsthe letter and digit of the goal object (e.g., B2). Theprivate state of player P digit is a digit (e.g., 1,2,3) foreach object, and the private state of player P letter is aletter (e.g., A,B,C) for each object. These states are s +1 , s − depending on which player goes first.On each turn t , a player j ( t ) ’s action a t can beeither (i) a message containing one or two Englishwords (e.g., circle ), or (ii) a click on an object,specified by its position (e.g., (1,3)). A click actionterminates the game. If the clicked object is the goal,a green square will appear around it which is visibleto both players; if the clicked object is not the goal,a red square appears instead. To discourage randomguessing, we prevent players from clicking in thefirst time step. Players do not see an explicit util-ity ( U ); however, they are instructed to think strate-gically to choose messages that lead to clicking onthe correct object while using a minimum numberof messages. Players can see the number of correctclicks, wrong clicks, and number of the words theyhave sent to each other so far at the top right of thescreen.We would like to study how context-dependentmeaning arises out of the interplay between a If the words are not inside the English dictionary, thesender receives an error and the message is rejected. This pre-vents players from circumventing the game rules by connectingmultiple words without spaces.
Kept
Table 1: Statistics for all 1680 games and the 1259games in which each message contains at least oneof the 12 most frequent words or “yes”, or “no”.context-independent literal semantics with context-sensitive planning, inference, and pragmatics. Thesimplicity of the InfoJigsaw game ensures that thisinterplay is not obscured by other challenges.
We generated 10 InfoJigsaw scenarios as follows:For each one, we randomly choose m, n to be ei-ther × or × (which results in possible pri-vate states). We randomly choose the properties ofall objects and randomly designated one as the goal.We randomly choose either P letter or P digit to start thegame first. Finally, to make the scenarios interesting,we keep a scenario if it satisfies: (i) Only the goalobject (and no other objects) has the goal combina-tion of the letter and digit; (ii) There exist at leasttwo goal-consistent objects for each player and theirsum of goal-consistent objects is at least m × n ; and(iii) all the goal consistent objects for each player donot share the same color, shape, or position (whichmeans all the goal-consistent objects are not in left,right, top, bottom, or middle).We collected a dataset of InfoJigsaw games onAmazon Mechanical Turk using the framework inHawkins (2015) as follows: 200 pairs of players a m e s (a) Number of exchanged mes-sages per game. score a m e s (b) Distribution of final gamescores. . . . . round a v e r a g e s c o r e (c) Average score over multiplerounds of game play, which inter-estingly remains constant. l e f tt op s qu a r e y e ll o w bo tt o m b l u e g r ee n r i gh t d i a m ond c i r c l e m i dd l e no t fr e qu e n c y (d) 12 most frequent words, whichmake up 73% of all tokens. t op l e f t bo tt o m l e f tt op r i gh t y e ll o w s qu a r e b l u e s qu a r e bo tt o m r i gh t y e ll o w d i a m ond b l u e g r ee nd i a m ond b l u ec i r c l e s qu a r e d i a m ondg r ee n l e f t c o l u m n t op m i dd l e no s qu a r e l e f t c i r c l e d i a m ondg r ee n s qu a r e y e ll o w t op r o w c i r c l e t op g r ee n c i r c l e c o l o r t opbo tt o m b l u e g r ee n y e ll o w g r ee n b l u e y e ll o w bo tt o m fr e qu e n c y (e) 30 most frequent messages, which make up 49% of all messages. Figure 3: Statistics of the collected corpus.each played all 10 scenarios in a random order. Outof 200 pairs, 32 pairs left the game prematurelywhich results in 168 pairs playing the total of 1680games. Players performed 4967 actions (messagesand clicks) total and obtained an average score (cor-rect clicks) of 7.5 per game. The average score perscenario varied from 6.4 to 8.2. Interestingly, thereis no significant difference in scores across the 10scenarios, suggesting that players do not adapt andbecome more proficient with more game play (Fig-ure 3c). Figure 3 shows the statistics of the collectedcorpus. Figure 4 shows one of the games, along withthe distribution of messages in the first time step ofall games played on this scenario.To focus on the strategic aspects of InfoJigsaw,we filtered the dataset to reduce the words in thetail. Specifically, we keep a game if all its mes-sages contain at least one of the 12 most frequentwords (shown in Figure 3d) or “yes” or “no”. Forexample, in Figure 4, the games containing mes-sages such as what color , mid row , color are filtered because they don’t contain any fre-quent words. Messages such as middle , eithermiddle , middle maybe , middle objects are mapped to middle . 1259 of 1680 games sur-vived. Table 1 compares the statistics between allgames and the ones that were kept. Most games thatwere filtered out contained less frequent synonyms(e.g. round instead of circle ). Some questionswere filtered out too (e.g., what color ). Filteredgames are 1.15 times longer on average. In order to understand the principles behind howhumans perform planning, inference, and pragmat-ics, we aim to develop a parsimonious, interpretablemodel with few parameters rather than a highly ex-pressive, data-driven model. Therefore, followingthe tradition of Rational Speech Acts (RSA) (Frankand Goodman, 2012; Goodman and Frank, 2016),we will define in this section a mapping from eachword to its literal semantics , and rely on the PIP re- l u e s qu a r e m i dd l e r o w s qu a r e d i a m ond bo tt o m r i gh t no tt op c o l o r m i dd l e y e ll o w s qu a r e m i dd l e l e f t s qu a r e no t c i r c l e s qu a r e s m i dd l e t w o w h a t c o l o r y e ll o w c i r c l e d i a m ond s qu a r e e it h e r m i dd l e m a yb e m i dd l e m i d r o w m i dd l e ob j ec t s fr e qu e n c y Find A1 Find A1
A ? A ?B ? B ?A ? B ? P letter view ? 3 ? 1? 1 ? 1? 2 ? 2 P digit view P digit : middle P letter : yellow circle P digit : bottom right P letter : click (1,2) Figure 4: Bottom: one of the games played byTurkers. Top: the distribution of utterances on thefirst message. Players choose to explain their pri-vate state in different ways. Some use more generalmessages (e.g., square diamond ), while someuse more specific ones (e.g., blue square ). Topdiagram shows the first 20 most frequent messageson the first round (72% of all the messages).currences (which we will describe in Section 4) toprovide context-dependence. One could also learnthe literal semantics by backpropagating throughthese recurrences, which has been done for simplerRSA models (Monroe and Potts, 2015); or learn theliteral semantics from data and then put RSA on top(Andreas et al., 2016); we leave this to future work.Suppose a player utters a single word circle .There are multiple possible context-dependent inter-pretations: • Are any circles goal-consistent? • All the circles are goal-consistent. • Some circles but no other objects are goal- s − (cid:34) (cid:35) s +1 (cid:34) (cid:35) Find B2 Find B2
B ?B ?C ? P letter view ? 2? 3? 2 P digit view (cid:74) square (cid:75) = (cid:110) s : s ∧ (cid:104) (cid:105) (cid:54) = (cid:104) (cid:105)(cid:111) (cid:74) top bottom (cid:75) = (cid:110) s : s ∧ (cid:16)(cid:104) (cid:105) ∨ (cid:104) (cid:105)(cid:17) (cid:54) = (cid:104) (cid:105)(cid:111) (cid:74) top blue (cid:75) = (cid:110) s : s ∧ (cid:16)(cid:104) (cid:105) ∧ (cid:104) (cid:105)(cid:17) (cid:54) = (cid:104) (cid:105)(cid:111) Figure 5: Private state of the players and meaningof two action sequences.consistent. • Most of the circles are goal-consistent. • At least one circle is goal-consistent.We will show that most of these interpretations canarise from a simple fixed semantics: roughly “somecircle is goal consistent”. We will now define asimple literal semantics of message actions such as circle , which forms the base case of PIP. Recallthat the shared world state w contains the goal (e.g.,B2) and, assuming P letter goes first, the private state s − ( s +1 ) of player P letter ( P digit ) contains the let-ter (digit) of each object. For notational simplicity,let us define s − ( s +1 ) to be a matrix correspond-ing to the spatial locations of the objects, where anentry is if the corresponding object has the goal let-ter (digit) and otherwise. Thus s j represents alsothe set of goal-consistent objects given the privateknowledge of that player. Figure 5 shows the privatestates of the players.We define two types of message actions: infor-mative (e.g., blue , top ) and verifying (e.g., yes , no ). Informative messages have immediate mean-ing, while verifying messages depend on the previ-ous utterance. Informative messages.
Informative messages de-scribe constraints on the speaker’s private state(which the partner does not know). For a message a ,efine (cid:74) a (cid:75) to be the set of consistent private states.For example, (cid:74) bottom (cid:75) is all private states wherethere are goal-consistent objects in the bottom row.Formally, for each word x that specifies some ob-ject property (e.g., blue , top ), define v x to be an n × m matrix where an entry is if the correspondingobject has the property x , and otherwise. Then, de-fine the literal semantics of a single-word message x to be (cid:74) x (cid:75) def = { s : s ∧ v x (cid:54) = 0 } , where ∧ denoteselement-wise and and denotes the zero matrix.That is, single-property messages can be glossed as“some goal-consistent object has property x ”.For a two-word message xy , we define the literalsemantics depending on the relationship between x and y . If x and y are mutually exclusive, then weinterpret xy as x or y (e.g., square circle );otherwise, we interpret xy as x and y (e.g., bluetop ). Formally, (cid:74) xy (cid:75) def = { s : s ∧ ( v x ∧ v y ) (cid:54) = 0 } if v x ∧ v y (cid:54) = 0 and { s : s ∧ ( v x ∨ v y ) (cid:54) = 0 } otherwise.See Figure 5 for some examples. Action sequences.
We now define the literal se-mantics of an entire action sequence (cid:74) a t (cid:75) j with re-spect to player j , which is the set of possible part-ner private states s − j . Intuitively, we want to sim-ply intersect the set of consistent private states ofthe informative messages, but we need to also han-dle verifying messages ( yes and no ), which arecontext-dependent. Formally, we say that privatestate s − j ∈ (cid:74) a t (cid:75) j if the following hold: for all in-formative messages a i uttered by − j , s − j ∈ (cid:74) a i (cid:75) ;and for all verifying messages a i uttered by − j if a i = yes then, s − j ∈ (cid:74) a i − (cid:75) ; and if a i = no then, s − j (cid:54)∈ (cid:74) a i − (cid:75) . Why does P digit in Figure 1 choose circle ratherthan top or click(1,2) ? Intuitively, when aplayer chooses an action, she should take into ac-count her previous actions, her partner’s actions, andthe effect of her action on future turns. She shoulddo all these while reasoning pragmatically that herpartner is also a strategic player.At a high-level, PIP defines a system of recur-rences revolving around three concepts, depicted inFigure 6: player j ’s beliefs over the partner’s pri- Figure 6: PIP is defined via a system of recur-rences that simultaneously captures planning, infer-ence, and pragmatics. The arrows show the depen-dencies between beliefs p , expected utilities V , andpolicy π .vate state p kj ( s − j | s j , a t ) , her expected utility ofthe game V kj ( s +1 , s − , a t ) , and her policy π kj ( a t | s j , a t − ) . Here, t indexes the current time and k indexes the depth of pragmatic recursion, which willbe explained later in Section 4.3. To simplify the no-tation, we have dropped w (shared world state) fromthe notation, since everything conditions on it. From player j ’s point of view, the purpose of infer-ence is to compute a distribution over the partner’sprivate state s − j given all actions thus far a t . Wefirst consider a “level 0” player, which simply as-signs a uniform distribution over all states consistentwith the literal semantics of a t , which we definedin Section 3: p j ( s − j | s j , a t ) ∝ (cid:40) s − j ∈ (cid:74) a t (cid:75) j , otherwise . (1)For example, Figure 7, shows the P letter ’s beliefabout P digit ’s private state after observing circle .Remember we show the private state of the playersas a matrix where an entry is if the correspondingobject has the goal letter (digit) and otherwise.A player’s own private state s j can also constrainher beliefs about her partner’s private state s − j . Forexample, in InfoJigsaw, the active player knowsthere is a goal, and so we set p kj ( s − j | s j , a t ) = 0 if s − j ∧ s j = 0 . (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) Figure 7: P letter ’s probability distribution over P digit ’s private state after P digit says circle in thegame shown in Figure 5. The purpose of planning is to compute a policy π kj ,which specifies a distribution over player j ’s actions a t given all past actions a t − . To construct the pol-icy, we first define an expected utility V kj via the fol-lowing forward-looking recurrence: When the gameis over (e.g., in InfoJigsaw, one player clicks on anobject), the expected utility of the dialogue is simplyits utility as defined by the game: V kj ( s +1 , s − , a t ) = U ( s +1 , s − , a t ) . (2)Otherwise, we compute the expected utility assum-ing that in the next turn, player j chooses action a t +1 with probability governed by her policy π kj ( a t +1 | s j , a t ) : V kj ( s +1 , s − , a t ) = (cid:88) a t +1 π kj ( a t +1 | s j , a t ) V k − j ( s − , s +1 , a t +1 ) . (3)Having defined the expected utility, we now de-fine the policy. First, let D kj be the gain in expectedutility V k − j ( s +1 , s − , a t ) over a simple baselinepolicy that ends the game immediately, yielding util-ity U ( s +1 , s − , a t − ) (which is simply a penaltyfor not finding the correct goal and a penalty foreach action). Of course, the partner’s private state s − j is unknown and must be marginalized out basedon player j ’s beliefs; let E kj be the expected gain.Let the probability of an action a t be proportional to max(0 , E kj ) α , where α ∈ [0 , ∞ ) is a hyperparame-ter that controls the rationality of the agent (a larger α means that the player chooses utility-maximizing actions more aggressively). Formally: D kj = V k − j ( s +1 , s − , a t ) − U ( s +1 , s − , a t − ) ,E kj = (cid:88) s − j p kj ( s − j | s j , a t − ) D kj ,π kj ( a t | s j , a t − ) ∝ max (cid:16) , E kj (cid:17) α . (4)In practice, we use a depth-limited recurrence,where the expected utility is computed assumingthat the game will end in f turns and the last actionis a click action (meaning that we only consider theaction sequences with size ≤ f and a clicking actionas the last action). Figure 8 shows how P digit com-putes the expected gain (Eqn. 4) of saying circle .Figure 8: Planning reasoning for the game in Fig-ure 1 (reproduced here in the bottom right). (a) Inorder to calculate the expected gain ( E ) of generat-ing circle , for every state s , P digit computes theprobability of s being the P letter ’s private state. (b)She then computes the expected utility ( V ) if shegenerates circle assuming P letter ’s private state is s . The purpose of pragmatics is to take into accountthe partner’s strategizing. We do this by construct-ing a level- k player that infers the partner’s pri-vate state, following the tradition of Rational SpeechActs (RSA) (Frank and Goodman, 2012; Goodmanand Frank, 2016). Recall that a level-0 player p j (Section 4.1) puts a uniform distribution over all theigure 9: Pragmatic reasoning for the game in Figure 1 (reproduced here in the upper right) at time step3. Players reason recursively about each others beliefs: the level- player puts a uniform distribution p j over all the states in which at least one circle is goal-consistent independent of the shared world state andprevious actions. The level- player assigns probability over the partner’s private states s − j proportional tothe probability that she would have performed the last action given that state s − j . For example, if (cid:104) (cid:105) were P digit ’s private state, then saying bottom would be more probable (given the shared world state); if (cid:104) (cid:105) were P digit ’s state, then clicking on the square would be a better option (given the previous actions). Butgiven that P digit uttered circle , (cid:104) (cid:105) is most likely, as reflected by p j .semantically valid private states of the partner. Alevel- k player assigns probability over the partner’sprivate state proportional to the probability that alevel- ( k − player would have performed the lastaction a t : p kj ( s − j | s j , a t ) ∝ π k − − j ( a t | s − j , a t − ) p kj ( s − j | s j , a t − ) . (5)Figure 9 shows an example of the pragmatic rea-soning. In the Section 4.2, we modeled the players as ra-tional agents that choose actions that lead to highergain utility. In the pragmatics section (Section 4.3),we described how a player infers the partner’s pri-vate state taking into account that her partner isacting cooperatively. The phenomena that emerges from the combination of the two is the topic of thissection.We first define the belief marginals B j of a player j to be the marginal probabilities that each objectis goal-consistent under the hypothesized partner’sprivate state s − j ∈ R m × n , conditioned on actions a t : B j ( s j , a t ) = (cid:88) s − j p kj ( s − j | s j , a t ) s − j . (6)At time t = 0 (before any actions), the beliefmarginals of both players are m × n matrices with . in all entries. The change in a belief marginalafter observing an action a t gives a sense of the ef-fective (context-dependent) meaning of that action.We first explain how pragmatics ( k > in (Eqn.5)) leads to rich action meanings. When a playerobserves her partner’s action a t , she assumes this ac-ind A2 ? 3 ? 2? 1 ? 2 P digit view (a) P digit : bottom P letter : right (k=0) P digit : bottom P letter : right (k=1) P digit estimationof P letter state(b) (cid:34) (cid:35) (cid:34) (cid:35) Figure 10: Belief marginals of P digit (Eqn. 6) afterobserving sequences of actions for different prag-matic depths k . (b) Without pragmatics ( k = 0 ), P digit thinks both objects on the right has the sameprobability to be goal-consistent. With pragmatics( k = 1 ), P digit thinks that the object in the bottomright is more likely to be goal-consistent.tion was chosen because it results in a higher utilitythan the alternatives. In other words, she infers thather partner’s private state cannot be one in which a t does not lead to high utility. As an example, say-ing circle instead of top circle or bottomcircle implies that there is more than one goal-consistent circle. The pragmatic depth k governs theextent to which this type of reasoning is applied.Recall in Section 4.2, a player chooses an actionconditioned on all previous actions, and the otherplayer assumed this context-dependence. As an ex-ample, Figure 10(d) shows how right changes itsmeaning when it follows bottom . We a priori set the reward of clicking on the goalto be +100 and clicking on the wrong object to be − . We set the smoothing α = 10 and the actioncost to be − based on the data. The larger theaction cost, the fewer messages will be used beforeselecting an object. Formally, after k actions:Utility = − k + (cid:40) +100 the goal object is clicked , − otherwise . (7) We smoothed all polices by adding . to theprobability of each action and re-normalizing. Bydefault, we set k = 1 (pragmatic depth (Eqn. 4)).When computing the expected utility (Eqn. 3) of thegame, we use a lookahead of f = 2 . Inference looksback b time steps (i.e. (Eqn. 1) and (Eqn. 5) arebased on a t − b +1: t rather than a t ); we set b = ∞ bydefault.We implemented two baseline policies: Random policy: for player j , the random pol-icy randomly chooses one of the semantically valid(Section 3) actions with respect to s j or clicks on agoal-consistent object. Formally, the random policyplaces a uniform distribution over: { a : s j ∈ (cid:74) a (cid:75) } ∪ { click ( u, v ) : ( s j ) u,v = 1 } . (8) Greedy Policy: assigns higher probability to theactions that convey more information about theplayer’s private state. We heuristically set the prob-ability of generating an action proportional to howmuch it shrinks the set of semantically valid states.Formally, for the message actions: π msg j ( a t | a t − , s j ) ∝ | (cid:74) a t − (cid:75) − j | − | (cid:74) a t (cid:75) − j | (9)For the clicking actions, we compute the belief stateas explained in Section 4.4. Remember B u,v is themarginal probability of the object in the row u andcolumn v being goal-consistent in the partner’s pri-vate state. Formally, for clicking actions: π click j ( click ( u, v ) | a t , s j ) ∝ min(( s j ) u,v , B j ( s j , a t ) u,v ) . (10)Finally, the greedy policy chooses a click action withprobability γ and a message action with probability − γ . So that γ increases as the player gets moreconfident about the position of the goal, we set γ tobe the probability of the most probable position ofthe goal: γ = max u,v π click j ( click ( u, v ) | a t , s j ) . Figure 11 compares the two baselines with PIP onthe task of predicting human behavior as measuredby log-likelihood. To estimate the best possible We bootstrap the data 1000 times and we show 90% confi-dence intervals. . − − . − − . ceilingPIPgreedyrandomlog-likelihoodAll roundsFirst round Figure 11: Average log-likelihood across messages.(a) Performance of PIP and baselines on all timesteps. (b) Performance of PIP and baselines on onlythe first time step along with the ceiling given by theentropy of the human data. The error bars show 90%confidence intervals.(i.e. ceiling) performance, we compute the entropyof the actions on the first time step based on ap-proximately 100 data points per scenario. For eachpolicy, we rank the actions by their probability indecreasing order (actions with the same probabilityare randomly ordered), and then compute the aver-age ranking across actions according to the differentpolicies; see Figure 13 for the results.To assess the different components (planning, in-ference, pragmatics) of PIP, we run PIP, ablatingone component at a time from the default setting of k = 1 , f = 2 , and b = ∞ (see Figure 12). Pragmatics.
Let PIP -prag be PIP but with a prag-matic depth (Eqn. 4) of k = 0 rather than k = 1 ,which means that PIP -prag only draws inferencesbased on the literal semantics of messages. PIP -prag loses . in average log-likelihood per action, high-lighting the importance of pragmatics in modelinghuman behavior. Planning.
Let PIP -plan be PIP, but looking aheadonly f = 1 step when computing the expected util-ity (Eqn. 3) rather than f = 2 . With a shorter fu-ture horizon, PIP -plan tries to give as much informa-tion as possible at each turn, whereas human playerstend to give information about their state incremen- tally. PIP -plan cannot capture this behavior and al-locates low probability to these kinds of dialogue.PIP -plan has an average log-likelihood which is . lower than that of PIP, highlighting the importanceof planning. Inference.
Let PIP -infer be PIP, but only looking atthe last utterance ( b = 1 ) rather than the full history( b = ∞ ). The results here are more nuanced. Al-though PIP -infer actually performs better than PIP onall games, we find that PIP -infer is worse than PIP byan average log-likelihood of . in predicting mes-sages after time step 3, highlighting the importanceof inference, but only in long games. It is likelythat additional noise involved in the inference pro-cess leads to the decreased performance when back-ward looking inference is not actually needed. Our work touches on ideas in game theory, prag-matic modeling, dialogue modeling, and learningcommunicative agents, which we highlight below.
Game theory.
According to game theory termi-nology (Shoham and Leyton-Brown, 2008), Info-Jigsaw is a non-cooperative (there is no offline op-timization of the player’s policy before the gamestarts), common-payoff (the players have the sameutility), incomplete information (the players haveprivate state) game with the sequential actions. Onerelated concept in game theory related to our modelis rationalizability (Bernheim, 1984; Pearce, 1984).A strategy is rationalizable if it is justifiable to playagainst a completely rational player. Another relatedconcept is epistemic games (Dekel and Siniscalchi,2015; Perea, 2012). Epistemic game theory studiesthe behavioral implications of rationality and mutualbeliefs in games.It is important to note that we are not interestedin notions of global optima or equilibria; rather, weare interested in modeling human behavior. Re-stricting words to a very restricted natural languagehas been studied in the context of language games(Wittgenstein, 1953; Lewis, 2008; Nowak et al.,1999; Franke, 2009; Huttegger et al., 2010).
Rational speech acts.
The pragmatic componentof PIP is based on Rational Speech Act framework(Frank and Goodman, 2012; Golland et al., 2010), I P P I P - p r a g P I P - p l a n P I P - i n f e r − . − . − . − . l og - li k e li hood (a) Performance over all games andall rounds. P I P P I P - p r a g P I P - p l a n P I P - i n f e r − . − . − . − . − . − . l og - li k e li hood (b) Performance over messages afterround 3. P I P P I P - p r a g P I P - p l a n P I P - i n f e r k (pragmatics) f (planning) b (inference) ∞ ∞ ∞ ≥ (c) Top: parameter setup. Bot-tom: expected ranking of humanmessages according to the differ-ent ablations Figure 12: Performance on ablations of PIP. Average log-likelihood per message, the whiskers show 90%confidence intervals. PIP has better performance of ablation of planning and pragmatics over all rounds.Looking only one step backward has a better performance in the first few rounds but it is worse after round3. r a ndo m g r ee dy P I P ce ili ng e xp ec t e d r a nk i ng All roundsFirst round
Figure 13: Expected ranking of the human mes-sages according to different policies. Error barsshow 90% confidence intervals.which defines recurrences capturing how one agentreasons about another. Similar ideas were exploredin the precursor work of Golland et al. (2010), andmuch work has ensued (Smith et al., 2013; Qing andFranke, 2014; Monroe and Potts, 2015; Ullman etal., 2016; Andreas and Klein, 2016).Most of this work is restricted to production andcomprehension of a single utterance. Hawkins etal. (2015) extend these ideas to two utterances (aquestion and an answer). Vogel et al. (2013b) in- tegrates planning with pragmatics using decentral-ized partially observable Markov processes (DEC-POMDPs). In their task, two bots should find andco-locate with a specific card. In contrast to Info-Jigsaw, their task can be completed without commu-nication; their agents only communicate once shar-ing the card location. They also only study artifi-cial agents playing together and were not concernedabout modeling human behavior.
Learning to communicate.
There is a rich liter-ature on multi-agent reinforcement learning (Buso-niu et al., 2008). Some works assume full visibil-ity and cooperate without communication, assumingthe world is completely visible to all agents (Lauerand Riedmiller, 2000; Littman, 2001); others as-sume a predefined convention for communication(Zhang and Lesser, 2013; Tan, 1993). There is alsosome work that learns the convention itself (Foersteret al., 2016; Sukhbaatar et al., 2016; Lazaridou et al.,2017; Mordatch and Abbeel, 2018). Lazaridou et al.(2017) puts humans in the loop to make the commu-nication more human-interpretable. In comparisonto these works, we seek to predict human behaviorinstead of modeling artificial agents that communi-cate with each other.
Dialogue.
There is also a lot of work in compu-tational linguistics and NLP on modeling dialogue.Allen and Perrault (1980) provides a model that in-ers the intention/plan of the other agent and usesthis plan to generate a response. Clark and Brennan(1991) explains how two players update their com-mon ground (mutual knowledge, mutual beliefs, andmutual assumptions) in order to coordinate. Recentwork in task-oriented dialogue uses POMDPs andend-to-end neural networks (Young, 2000; Young etal., 2013; Wen et al., 2017; He et al., 2017). In thiswork, instead of learning from a large corpus, wepredict human behavior without learning, albeit in amuch more strategic, stylized setting (two words perutterance).
In this paper, we started with the observation that hu-mans use language in a very contextual way drivenby their goals. We identified three salient aspects—planning, inference, pragmatics—and proposed aunified model, PIP, that captures all three aspects si-multaneously. Our main result is that a very simple,context-independent literal semantics can give risevia the recurrences to rich phenomena. We studythese phenomena in a new game, InfoJigsaw, andshow that PIP is able to capture human behavior.
Reproducibility
All code, data, and experiments for this paper areavailable on the CodaLab platform at https://worksheets.codalab.org/worksheets/0x052129c7afa9498481185b553d23f0f9/ . Acknowledgments
We would like to thank the anonymous reviewersand the action editor for their helpful comments. Wealso thank Will Monroe for providing valuable feed-back on early drafts.
References
J. F. Allen and C. R. Perrault. 1980. Analyzing intentionin utterances.
Artificial Intelligence , 15(3):143–178.J. Andreas and D. Klein. 2016. Reasoning aboutpragmatics with neural listeners and speakers. In
Empirical Methods in Natural Language Processing(EMNLP) , pages 1173–1182.J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. 2016.Learning to compose neural networks for question an- swering. In
Association for Computational Linguistics(ACL) , pages 1545–1554.B. D. Bernheim. 1984. Rationalizable strategic behav-ior.
Econometrica: Journal of the Econometric Soci-ety , pages 1007–1028.L. Busoniu, R. Babuska, and B. D. Schutter. 2008.A comprehensive survey of multiagent reinforcementlearning.
IEEE Trans. Systems, Man, and Cybernetics,Part C , 38(2):156–172.H. H. Clark and S. E. Brennan. 1991.
Groundingin Communication . Perspectives on Socially SharedCognition.E. Dekel and M. Siniscalchi. 2015.
Epistemic game the-ory , volume 4. Handbook of Game Theory with Eco-nomic Applications.J. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson.2016. Learning to communicate with deep multi-agentreinforcement learning. In
Advances in Neural Infor-mation Processing Systems (NIPS) , pages 2137–2145.M. Frank and N. D. Goodman. 2012. Predictingpragmatic reasoning in language games.
Science ,336:998–998.M. Franke. 2009.
Signal to act: Game theory in prag-matics . Institute for Logic, Language and Computa-tion.D. Golland, P. Liang, and D. Klein. 2010. A game-theoretic approach to generating spatial descriptions.In
Empirical Methods in Natural Language Process-ing (EMNLP) , pages 410–419.N. D. Goodman and M. C. Frank. 2016. Pragmatic lan-guage interpretation as probabilistic inference.
Trendsin Cognitive Sciences , 20(11):818–829.H. P. Grice. 1975. Logic and conversation.
Syntax andSemantics , 3:41–58.R. X. D. Hawkins, A. Stuhlm¨uller, J. Degen, and N. D.Goodman. 2015. Why do you ask? Good ques-tions provoke informative answers. In
Proceedings ofthe Thirty-Seventh Annual Conference of the CognitiveScience Society .R. X. Hawkins. 2015. Conducting real-time multiplayerexperiments on the web.
Behavior Research Methods ,47(4):966–976.H. He, A. Balakrishnan, M. Eric, and P. Liang. 2017.Learning symmetric collaborative dialogue agentswith dynamic knowledge graph embeddings. In
As-sociation for Computational Linguistics (ACL) , pages1766–1776.S. M. Huttegger, B. Skyrms, R. Smead, and K. J. Zoll-man. 2010. Evolutionary dynamics of Lewis signal-ing games: signaling systems vs. partial pooling.
Syn-these , 172(1):177–191.J. Krishnamurthy and T. Kollar. 2013. Jointly learningto parse and perceive: Connecting natural language tohe physical world.
Transactions of the Association forComputational Linguistics (TACL) , 1:193–206.M. Lauer and M. Riedmiller. 2000. An algorithm for dis-tributed reinforcement learning in cooperative multi-agent systems. In
International Conference on Ma-chine Learning (ICML) , pages 535–542.A. Lazaridou, A. Peysakhovich, and M. Baroni. 2017.Multi-agent cooperation and the emergence of (natu-ral) language. In
International Conference on Learn-ing Representations (ICLR) .D. Lewis. 2008.
Convention: A philosophical study .John Wiley & Sons.M. L. Littman. 2001. Value-function reinforcementlearning in markov games.
Cognitive Systems Re-search , 2(1):55–66.C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, andD. Fox. 2012. A joint model of language and per-ception for grounded attribute learning. In
Inter-national Conference on Machine Learning (ICML) ,pages 1671–1678.W. Monroe and C. Potts. 2015. Learning in the RationalSpeech Acts model. In
Proceedings of 20th Amster-dam Colloquium .R. Montague. 1973. The proper treatment of quantifi-cation in ordinary English. In
Approaches to NaturalLanguage , pages 221–242.I. Mordatch and P. Abbeel. 2018. Emergence ofgrounded compositional language in multi-agent pop-ulations. In
Association for the Advancement of Artifi-cial Intelligence (AAAI) .M. A. Nowak, J. B. Plotkin, and D. C. Krakauer. 1999.The evolutionary language game.
Journal of Theoret-ical Biology , 200(2):147–162.D. G. Pearce. 1984. Rationalizable strategic behaviorand the problem of perfection.
Econometrica: Journalof the Econometric Society , pages 1029–1050.A. Perea. 2012.
Epistemic game theory: reasoning andchoice . Cambridge University Press.C. Potts. 2012. Goal-driven answers in the Cards dia-logue corpus. In
Proceedings of the 30th West CoastConference on Formal Linguistics , pages 1–20.C. Qing and M. Franke. 2014. Gradable adjectives,vagueness, and optimal language use: A speaker-oriented model. In
Semantics and Linguistic Theory ,volume 24, pages 23–41.Y. Shoham and K. Leyton-Brown. 2008.
Multiagent sys-tems: Algorithmic, game-theoretic, and logical foun-dations . Cambridge University Press.N. J. Smith, N. D. Goodman, and M. C. Frank. 2013.Learning and using language via recursive pragmaticreasoning about other agents. In
Advances in NeuralInformation Processing Systems (NIPS) , pages 3039–3047. S. Sukhbaatar, R. Fergus, et al. 2016. Learningmultiagent communication with backpropagation. In
Advances in Neural Information Processing Systems(NIPS) , pages 2244–2252.M. Tan. 1993. Multi-agent reinforcement learning: Inde-pendent vs. cooperative agents. In
International Con-ference on Machine Learning (ICML) , pages 330–337.T. D. Ullman, Y. Xu, and N. D. Goodman. 2016. Thepragmatics of spatial language. In
Proceedings of the38th Annual Conference of the Cognitive Science So-ciety .A. Vogel, M. Bodoia, C. Potts, and D. Jurafsky. 2013a.Emergence of gricean maxims from multi-agent deci-sion theory. In
North American Association for Com-putational Linguistics (NAACL) , pages 1072–1081.A. Vogel, C. Potts, and D. Jurafsky. 2013b. Implica-tures and nested beliefs in approximate decentralized-pomdps. In
Association for Computational Linguistics(ACL) , pages 74–80.T. Wen, M. Gasic, N. Mrksic, L. M. Rojas-Barahona,P. Su, S. Ultes, D. Vandyke, and S. Young. 2017. Anetwork-based end-to-end trainable task-oriented dia-logue system. In
European Association for Computa-tional Linguistics (EACL) , pages 438–449.L. Wittgenstein. 1953.
Philosophical Investigations .Blackwell, Oxford.S. Young, M. Gaˇsi´c, B. Thomson, and J. D. Williams.2013. POMDP-based statistical spoken dialogsystems: A review.
Proceedings of the IEEE ,101(5):1160–1179.S. J. Young. 2000. Probabilistic methods in spoken-dialogue systems.
Philosophical Transactions of theRoyal Society of London A: Mathematical, Physicaland Engineering Sciences , 358(1769):1389–1402.C. Zhang and V. Lesser. 2013. Coordinating multi-agent reinforcement learning with limited communica-tion. In