[PDF] Converse, Focus and Guess -- Towards Multi-Document Driven Dialogue

Abstract

We propose a novel task, Multi-Document Driven Dialogue (MD3), in which an agent can guess the target document that the user is interested in by leading a dialogue. To benchmark progress, we introduce a new dataset of GuessMovie, which contains 16,881 documents, each describing a movie, and associated 13,434 dialogues. Further, we propose the MD3 model. Keeping guessing the target document in mind, it converses with the user conditioned on both document engagement and user feedback. In order to incorporate large-scale external documents into the dialogue, it pretrains a document representation which is sensitive to attributes it talks about an object. Then it tracks dialogue state by detecting evolvement of document belief and attribute belief, and finally optimizes dialogue policy in principle of entropy decreasing and reward increasing, which is expected to successfully guess the user's target in a minimum number of turns. Experiments show that our method significantly outperforms several strong baseline methods and is very close to human's performance.

Full PDF

CConverse, Focus and Guess - Towards Multi-Document Driven Dialogue

Han Liu , Caixia Yuan , Xiaojie Wang ,Yushu Yang , Huixing Jiang , Zhongyuan Wang Beijing University of Posts and Telecommunications, Meituan Group { liuhan,yuancx,xjwang } @bupt.edu.cn, { yangyushu, jianghuixing, wangzhongyuan02 } @meituan.com Abstract

We propose a novel task, M ulti- D ocument D riven D ialogue( MD3 ), in which an agent can guess the target document thatthe user is interested in by leading a dialogue. To benchmarkprogress, we introduce a new dataset of GuessMovie, whichcontains 16,881 documents, each describing a movie, andassociated 13,434 dialogues. Further, we propose the MD3model. Keeping guessing the target document in mind, it con-verses with the user conditioned on both document engage-ment and user feedback. In order to incorporate large-scaleexternal documents into the dialogue, it pretrains a documentrepresentation which is sensitive to attributes it talks about anobject. Then it tracks dialogue state by detecting evolvementof document belief and attribute belief, and ﬁnally optimizesdialogue policy in principle of entropy decreasing and rewardincreasing, which is expected to successfully guess the user’starget in a minimum number of turns. Experiments show thatour method signiﬁcantly outperforms several strong baselinemethods and is very close to human’s performance. Introduction

The recent progress with human-machine dialogue tech-niques enable conversational agents to be extensively ap-plied in customer service, information retrieval, personalassistance and so on. In order to assist the user to accom-plish speciﬁc tasks, the agent must necessarily query externalknowledge. Several works have focused on incorporatingstructured knowledge base (KB) into dialogues (Dhingraet al. 2017; Madotto, Wu, and Fung 2018; Wu, Socher, andXiong 2019) through KB lookup. Although these effortsscales nicely to huge knowledge base, many real-world task-oriented dialogues involve in referring to a great number ofdocuments (such as manuals, instruction booklets, and otherinformational documents). Since the complexity of documentunderstanding, developing dialogues with many groundingarticles is far from a trial task.In this paper, we consider in particular the problem ofmulti-document driven dialogue (MD3), where the agentleads a dialogue with the user with a particular conversationgoal and the engagement of multiple documents. To this end, * https://github.com/laddie132/MD3 we propose a MD3 game— GuessMovie . In the game, theuser selects a movie at the beginning of dialogue, which isunknown to the agent. The agent is provided with a set of can-didate documents, each describing a movie, and tries to guesswhich movie the user selects by asking a series questions (e.g.“

Who is the director of the movie? ” or “

When is it released? ”).The user informs the agent his answer or says “ unknown ”.The agent’s goal is to guess the target movie in the shortest di-alogue turns. It assumes that only the agent has access to thedocuments, and at least one document describing the targetmovie. Figure 1 shows an example.Two key challenges arising from MD3 task is: (1) howto efﬁciently encode and incorporate a large scale of longunstructured documents in a dialogue, and (2) how to learnoptimal dialogue policy to efﬁciently fulﬁll dialogue taskwith smooth engagement of external documents.To address the ﬁrst challenge, we propose an factor awaredocument embedding, which assumes that a text is typicallycomposed of several key factors it wants to narrate. For exam-ple, a text describing a movie mainly consists of attributes like“ directed by ”, “ release year ”, “ genre ”, and so on. For eachattribute, attribute aware document embedding is trained withthe goal of correctly recognizing the document containing thegiven attribute and attribute value from a set of documents.The document is ﬁnally represented as the concatenation ofall attribute aware document embeddings.In order to manipulate the dialogue towards efﬁcientlyguessing the correct target document, the proposed MD3model traces dialogue state by monitoring a document beliefdistribution and attribute belief distribution, both in globaldialogue-level and temporal turn-level, which is viewed anexplicit representation of dialogue dynamics leading to targetmovie. Besides, the model also calculates the uncertainty ofeach attribute through a document differentiated representa-tion. In order to fulﬁll the dialogue goal within minimum dia-logue turns, at each turn the policy model tends to discussingthe attribute with the highest belief and highest uncertaintywith asking the user. When the belief of a document is higherthan a threshold, the agent will execute a guess action. Adialogue is deemed as ended up with a guess action.Remarkably, the above dialogue state tracing and dialoguepolicy optimizing are jointly trained using reinforcementlearning.Our contributions are summarized as follows: a r X i v : . [ c s . C L ] F e b hen is it released?Movie=?Directed By=MilchoRelease Year=1994 MoviesName Movies Introduce

Clear and Present Danger ... is a spy action thriller film directed by

Phillip Noyce ... in

English …Before the Rain … is a film starring …It was directed and written by

Milcho …Shadows ... is a film from the Republic of

Macedonia . The film was directed, produced and written by

Milcho ...Dust ... is a

British-

Macedonian

Western drama film starring ... It was directed and written by

Milcho ... It’s released on 1994.Who is the director?It’s directed by Milcho.What’s the language?I’m not sure about that.I guess the movie is“

Before the Rain”.

Figure 1: Sample dialogue from GuessMovie dataset. The user’s target is “

Before the Rain ” and only knows its director is Milcho,and release year is . The goal of the agent is to correctly guess the target movie via asking minimum number of questionsto the user. Here we only demonstrate four candidate movie documents due to space constraints (any number of documentscould be provided in practice). Since candidate movies have the highest uncertainty (i.e. the largest information entropy) on“ release year ”, the agent ask “

When is it released? ” to minimize the range of candidates. After the ﬁrst turn, the last two movies“

Shadows ” and “

Dust ” can be excluded. Similarly, the agent ask “ in language ” and “ directed by ” in the second and third turn.1. We propose a novel MD3 task, and release a new publicly-available benchmark dialogue corpus—GuessMovie, thatwe hope will help further work on document-driven task-oriented dialogue agents.2. We introduce the MD3 model, a highly performant neuraldialogue agent that is able to smoothly incorporate multipledocuments through entangling document representation,document belief and attribute belief to the dynamics ofthe dialogue. As far as we know, this is the ﬁrst study oftask-oriented dialogue based on a large scale of documents.3. The proposed document-aware dialogue policy achieve themaximum dialogue success rate in the minimum numberof dialogue turns compared with several baseline models.

Related Work

The closest work to ours lies in the area of dialogue sys-tem incorporating unstructured knowledge. Ghazvininejadet al. (2018); Parthasarathi and Pineau (2018) use an Encoder-Decoder architecture where the decoder receives as input anencoding of both the context utterance and the external textknowledge. Dinan et al. (2018); Li et al. (2019) investigateextended Transformer architecture to handle the documentknowledge in multi-turn dialogue. Reddy, Chen, and Man-ning (2019) use documents as knowledge sources for conver-sational Question-Answering. This line of work aims eitherat producing more contentful and diverse responses, or atextracting answers of user questions. The dialogue agentwe build has a speciﬁc goal throughout the dialogue, fromthis point of view the dialogue system in this paper is task-oriented one. While these works mainly focusing on chattingabout the content of a given document without a speciﬁcdialogue goal. Besides, our task differs from them in that ouragent interact with a large-scale external documents, whichposes new challenges for grounding dialogues.Another line of related work is on “

Guess ”-style dialogues,among which

Q20 Game (Burgener 2006; Zhao and Eskenazi 2016; Hu et al. 2018) is a typical object guessing game. InQ20, the agent guesses the target object within 20 turns ofquestions and answers. Each object is tied with a structuredKB and the user is restricted to passively answer “

Yes, Noor Unknown ”. Dhingra et al. (2017) proposes a new task andmethod named KB-InfoBot, which can be regarded as anextension of Q20 Game. It aims to retrieval from the struc-tured KB through a dialogue by a soft query operation. Theseworks mainly focus on integrating structured knowledge intodialogue systems, while it requires a lot of work to build up,and is only limited to expressing precise facts. Documents asknowledge are much easier to obtain and provide a wide spec-trum of knowledge, including factoid, event logics, subjectiveopinion, etc.In the ﬁeld of computer vision, both the ImageGuessing(Das et al. 2017) and GuessWhat (De Vries et al. 2017) tryto guess a picture or an object through multi-turn dialogue,which greatly expands the range of dialogue applications(Pang and Wang 2020b,a). We use a similar approach toextend dialogue games into the document-driven ones. Dif-ferent from the vision ﬁeld, how to encode large-scale textinformation is a vital challenge.

Dataset: GuessMovie

We build a benchmark

GuessMovie dataset for MD3 taskon the base of the dataset WikiMovies (Miller et al. 2016).In WikiMovies, there is a large-scale document knowledge.Each is a movie introduction text, which is derived from theﬁrst paragraph of Wikipedia. In addition, it also contains astructured KB of the same movie collection, which is derivedfrom MovieLens, OMDB and other datasets. There are to-tally 10 different attributes in movie KBs. But we select thecommon 6 ones since others are randomly discussed in thecorresponding documents.As for GuessMovie, we ﬁrstly align a document with apiece of structured knowledge. For a speciﬁc movie, somettribute might be missing from the text and the KB, and someattribute might hold more than one values (e.g. a movie mayhave more than one starred actors.). Such cases are preservedin GuessMovie, which is expected to make the dialogue morediverse and realistic.Then we create multi-turn dialogues for some documents.We design a dialogue simulator that interacts with the struc-tured KB to generate dialogues. It consists of two agentsplaying the roles of the user and the system. Both agentsinteract with each other using a ﬁnite set of dialogue actsdirecting the dialogue. The user simulator is constructed ina handcrafted way introduced in the following section. Thesystem agent is provided with a candidate KB, includingthe target knowledge and a set of other randomly selectedknowledge. At the beginning, the system agent generates adialogue act of “ asking ” an attribute (e.g. “ when is the moviereleased? ”). The probability of an attribute being chosen as“ asked attribute ” is proportional to the information entropycomputed according to its distribution within the candidateKB. After a turn of “question-answering”, the KBs insistentwith the facts so far are ﬁltered out. The dialogue continuesuntil the system agent is conﬁdent with the target KB andexecutes a “ guess ” action. For natural language generation,we use several diverse natural language patterns that takes anattribute or an attribute-value pair as argument.Totally, GuessMovie is comprised up with 13,434 dia-logues for 16,881 documents. The average length of docu-ments is 107.66 words. More statistics are described in Table1. It’s worth noting that we do not include any domain-speciﬁc constraints in both simulated agents. Although ourexamples use Wikipedia articles about movies, we see thesame techniques being valid for other external documentssuch as manuals, instruction booklets, and other informa-tional documents, as far as they can be loosely structed asseveral facets about an object.

Method

Figure 2 illustrates the overall architecture of the M ulti- D ocument D riven D ialogue ( MD3 ) model, including ﬁveparts: document representation, Natural Language Under-standing (NLU), Dialogue State Tracking (DST), PolicyModel (PM) and Natural Language Generation (NLG). Thissection introduces each part in details. Task Deﬁnition

Given a set of M documents { D i } Mi =1 ascandidates, and a target document known only by the user(each of which corresponds to a unique object, i.e. a movie inour case), the agent outputs a consecutive of responses thatask an attribute (e.g., “ when is it released? ”, “ Is it released in1990? ”), or guesses a target document and ends the dialogue.The user can provide the answer to the questions, or answerlike “

I don’t know ”. AaDR for Document Representation

As for encoding large-scale documents along with optimizingdialogue agent is computationally burdening and inviable,we resort to pretrain the documents in advance. We argue that the pre-trained representation should be not only liableto the original meaning but also helpful for constructing adocument-driven task-oriented dialogue agent.To this end, we propose A ttribute- a ware D ocument R epresentation (AaDR) . We assume that each document talksabout several attributes with index { , ..., j, ..., L } , such as“ directed by ”, “ release year ”, “ in language ” and so on inmovie scenario.Inspired by Hierarchical Attention Networks (HAN) (Yanget al. 2016) and Hierarchical Label-Wise LSTM (Liu, Yuan,and Wang 2020), we introduce an Attribute-aware HAN (Aa-HAN) to encode each document, which seen attribute as label.Speciﬁcally, each attribute is used to index the correspondingparameters in hierarchical attention. This mechanism can cap-ture different information for different attributes. Combinedwith the contrastive loss (Hadsell, Chopra, and LeCun 2006),an attribute-aware document representation is learned.

Pre-training

Randomly sample a target document as T .For an attribute-value pair ( a j , v jk ) , we sample a positivedocument C + which has the same value v jk for attribute a j as T and several negatives C − which have different attributesvalues. These two parts are combined to a training sample { C i } Ri =1 for a j . The training goal is to distinguish the positivedocument from all { C i } Ri =1 .As for the target T and a candidate C i , we can calculatethe corresponding document representation on a certain at-tribute a j : H tj ∈ R d and H c i j ∈ R d . Further, calculate thesimilarity of two documents directly as follows. H tj = AaHAN t ( T, a j ) (1) H c i j = AaHAN c ( C i , a j ) (2)It’s trained using a negative log-likelihood loss function. E t,j,c + ,c − (cid:34) − log exp (( H tj ) T H c + j )exp (( H tj ) T H c + j ) + (cid:80) c − exp (( H tj ) T H c − j ) (cid:35) (3) Representation

After pre-training, for each document D i ,we can obtain the representation Q ij ∈ R d on attribute a j ,which is the concatenation of outputs of target encoder andcandidate encoder. H tij = AaHAN t ( D i , a j ) (4) H cij = AaHAN c ( D i , a j ) (5) Q ij = [ H tij ; H cij ] (6)After that, the ﬁnal document representation Q i ∈ R L × d is obtained by concatenating L attribute-aware representa-tion. DaLU for NLU

We propose D ocument- a ware L anguage U nderstanding(DaLU) for NLU. In the previous turn, the agent’s questionis x t , and the user’s response is u t , which are together con-catenated into a long sentence and encoded using a BiGRU.Take the last hidden state as output G t ∈ R d .ttr directed by release year written by starred actors has genre in languageNum 14853 16299 12712 13204 12118 3071Ent 6187 103 10404 10180 23 96Ave 1.05 1.00 1.48 2.48 1.27 1.10Table 1: GuessMovie dataset statistics. Num denotes the number of documents containing it (with total 16,881 documents).

Ent denotes the number of rare values.

Ave denotes the average number of values that an attribute has.

Mean Doc. Rep.

Sub. & Square

Docs. Diﬀ.

WeightedFC Layer PM Dialog-LevelAttrBelief

Docs. Rep.

Match

Turn-LevelDocBelief

NLU { Mul. & Norm

Dialog-LevelDocBelief

DST

Add & Norm

Turn-LevelAttrBelief

NLG

Mul. ˆ p t ˆ π t p t π t γ t a t +1 … … Q diff Q A a H A N A a H A N T C- C+ C-… A a H A N A a H A N … Classifier A tt r . j A tt r . j A tt r . j A tt r . j p Doc. Rep. [ x t ; u t ] Figure 2: Overall architecture of MD3 model with AaDR for document representation, DaLU for NLU, DaST for DST, DaPOfor PM and rule for NLG.Therefore, the similarity ˆ S t ∈ R M × L between previousturn G t ∈ R d and candidate documents Q ∈ R M × L × d (concatenation of each candidate Q i etc.) can be calculateddirectly by a bilinear method. It reﬂects the matching degreefor each candidate Q . ˆ S t = G t W s Q T (7)where W s ∈ R d × d is a trainable parameter.In addition, we further calculate two distributions: (1) theattribute type ˜ π t ∈ R L . (2) a ﬂag α t ∈ R indicating whetherthe response is “ unknown ”. The closer the value is to 1, theless valid information is included in this turn. ˜ π t = softmax( W attr G t ) (8) α t = sigmoid( W unk G t ) (9)where W attr ∈ R L × d and W unk ∈ R × d are trainableparameters.When previous response is “ unknown ”, the selected prob-ability of each candidate is equal on turn level. There-fore, we expand the similarity ˆ S t on attribute dimension,concatenating a vector whose similarity is all , and get S t ∈ R M × ( L +1) . For the attribute type ˜ π t , we also expandthe attribute dimension by fusing the ˜ π t and α t , and get β t ∈ R ( L +1) . S t = [ ˆ S t ; ] (10) β t = [˜ π t (1 − α t ); α t ] (11)Further the Turn-Level Doc Belief ˆ p t ∈ R M is obtained,which is the probability of each candidate document being Doc is the abbreviation of document and Attr is attribute. selected at current turn, and also the Turn-Level Attr Belief ˆ π t ∈ R L . ˆ p t = softmax( S t β t ) (12) ˆ π t = α t ˜ π t (13) DaST for DST

We propose D ocument- a ware S tate T racking (DaST) for DST.The dialogue state is deﬁned as the following two parts: (1)Dialog-Level Doc Belief p t ∈ R M represents the probabilityof each document being selected. (2) Dialog-Level Attr Belief π t ∈ R L is the probability of an attribute being unknown.The lower the value, the higher the attribute belief.When a document is excluded, it will rarely be selected.But the attribute belief is accumulated for each one, indi-cating whether the attribute will be asked. This two beliefdistributions are updated as follows. p t = norm( p t − ◦ ˆ p t ) (14) π t = min( π t − + ˆ π t , (15)where norm is the L1-Normalization method. The initialvalue p is a uniform distribution, and π is initialized to zerovector. At the beginning of each dialogue, agent directly use p and π into the PM module. DaPO for PM

We propose D ocument- a ware P olicy O ptimizing (DaPO) forPM to minimize the number of dialogue turns and guess thetrue target.In order to describe the degree of discreteness of the data,we introduce a similar variance measure to calculate theifferentiated representation Q diffi ∈ R L × d for each docu-ment Q i . It’s used to describe the degree of attribute discrete-ness. Q = 1 N N (cid:88) i =1 Q i (16) Q diffi = ( Q i − Q ) (17)where N is the size of whole dataset.Note that the Dialog-Level Doc Belief p t is the conﬁdenceof each document. Therefore, a weighted sum is used on thedifferentiated representation Q diff ∈ R M × ( L × d ) (concate-nation of each candidate Q diffi etc.) to obtain v t ∈ R L × d ,which is used to describe attribute uncertainty γ t ∈ R L overall document candidates. v t = ( Q diff ) T p t (18) γ t = v t W diff (19)where W diff ∈ R d × is a trainable parameter.Normally, the agent can directly ask the attribute withthe largest uncertainty γ t , expected to successfully end thedialogue in a minimum number of turns. However, somehighly uncertain attributes may have low belief, therefore,the “ ask ” action of time step t + 1 should be calculated as: a t +1 = softmax( γ t (1 − π t )) (20)It is also interesting to note that a t +1 is equivalent to theprobability of each attribute being asked at timestep t + 1 . Rule for NLG

In NLG module, the agent generates natural language re-sponse respectively for “ ask ” and “ guess ” action. For “ ask ”action, it produces a response of asking the most probableattribute greedily according to a t +1 . For “ guess ” action, theagent guesses the movie corresponding with the most prob-able document greedily according to p t . We use predeﬁnednatural language templates to converse with the user. Thetermination includes two cases.1. Positive termination: when the maximum value of Dialog-Level Doc Belief p t exceeds a certain threshold K .2. Passive termination: when the set maximum number ofdialogue turns is reached. Experiments

Experimental Setting

We divide GuessMovie into two disjoint parts. The 70% partis used for pre-training document representation and NLUmodule, and the remaining 30% is used for training MD3with 50k simulations using REINFORCE (Williams 1992)algorithm. The discount rate is 0.9. After training, we run afurther 5k simulations to test the performance.We construct a user simulator (Schatzmann et al. 2007; Liet al. 2016) with handcrafted rules because the user only needto answer the agent’s questions passively. It randomly selectsthe target at the beginning. During the dialogue, the relevantvalue is indexed from the current structured KB and ﬁlled into a natural language template to response. Otherwise, if the userdoesn’t have knowledge, the answer is unknown. Speciﬁcally,in order to increase the dialogue diversity, we randomly masksome values for the 6 attributes with proportion of 0.1 at thebeginning of each dialogue. However, all documents remainunchanged in the agent.The reward function is similar to Dhingra et al. (2017). Ifthe rank of target document is in the top R = 3 results, thereward is max (0 , − ( r − /R )) , where r is the actualrank of the target. Otherwise, if the dialogue failes, the rewardis − . In addition, a reward of − . will be given in eachturns, making the dialogue tend to be ﬁnshed in a smallernumber of turns.We use Adam(Kingma and Ba 2014) optimizer with learn-ing rate 0.001 and GloVe (Pennington, Socher, and Manning2014) word embedding. The number of candidate documentsfor document representation and dialogue is 32 by default.The maximum number of turns is 5. The probability threshold K for whether performing a Guess action is 0.5. Baselines

The following introduces several different modules to com-pare with our method, including the machine reading com-prehension (MRC) for NLU, random and ﬁxed policy forPM.

NLU

Here we introduce two methods.•

MRC : directly use the attributes and values for a document,which is extracted by BERT(Devlin et al. 2018) with noanswer supported. The part of GuessMovie dataset usedfor documents pre-training are modiﬁed into a standardextractive MRC dataset, and divided into training, devel-opment and test parts. The testing EM score is 82.99 andF1 score is 87.44. After training, we extract the structuredMRC-KB from the other part of GuessMovie dataset fordialogue. The similarity of each movie and the currentuser response is measured by calculating the overlap ra-tio, and normalized as the Turn-Level Doc Belief ˆ p t . TheTurn-Level Attr Belief ˆ π t can be obtained directly.• DaLU : the method proposed in this paper. PM Here we introduce three methods.•

Rand : randomly selecting an attribute to ask.•

Fixed : asking attributes in a ﬁxed order.•

DaPO : the method proposed in this paper.

Human Performance

Human performance is the ceilingof this task by assuming human can understand the candi-date documents accurately, and calculate the truth documentdistribution. An attribute with the most discriminating infor-mation can be generated. In this scenario, we directly adoptstructured KB and attribute-value pairs instead of naturallanguage, in order to simulate accurate interaction.•

Human-NLU : we use a handcrafted module to directlymatch the current turn with each structured KB. If theymatch, the selection probability in Turn-Level Doc Belief

LU PM 32 64 128

S1 S3 M T R S1 S3 M T R S1 S3 M T R

MRC Rand .51 .72 .63 5.00 0.87 .40 .62 .56 5.00 0.57 .28 .53 .42 5.00 0.27MRC Fixed .67 .89 .78 5.00 1.29 .57 .79 .69 5.00 1.06 .49 .67 .61 5.00 0.76MRC DaPO .79 .94 .87 5.00 1.49 .71 .93 .82 5.00 1.41 .52 .86 .69 5.00 1.23DaLU Rand .56 .81 .70 4.34 1.21 .47 .71 .62 4.64 0.91 .38 .61 .53 4.84 0.57DaLU Fixed .71 .93 .83 3.78 1.57 .65 .87 .77 4.28 1.37 .56 .80 .70 4.66 1.14

DaLU DaPO .83 .97 .90 3.42 1.73 .74 .93 .84 3.86 1.56 .67 .88 .78 4.29 1.38

DaLU w/o AU .65 .91 .79 3.88 1.51 .62 .86 .75 4.42 1.33 .55 .81 .70 4.76 1.14DaLU w/o AB .47 .82 .65 4.64 1.14 .45 .74 .61 4.48 0.98 .36 .63 .52 4.70 0.66Human Human .99 .99 .99 3.28 1.87 .98 .99 .98 3.52 1.83 .96 .99 .97 3.80 1.79

Table 2: Dialogue test results with various combinations of NLU and PM on GuessMovie dataset. S1 denotes the top-1 dialoguesuccess rate. S3 denotes the top-3 dialogue success rate. M denotes the target document ranking metric MRR. T denotes theaverage number of dialogue turns. R denotes the average rewards. AU denotes attribute uncertainty. AB denotes attribute belief. ˆ p t is set to 1, otherwise the probability is set to 0. It’s thesame for Turn-Level Attr Belief ˆ π t .• Human-PM : the Dialog-Level Doc Belief p is thresholdedto obtain a ﬁltered documents. Based on the structured KB,the entropy of each attribute is calculated, and normalizedas distribution γ t . After that, it is still fused with the Dialog-Level Attr Belief π t . Results

We simulated 5k dialogues randomly for testing the perfor-mance. The complete results are shown in the Table 2. Vari-ous combinations of NLU and PM module are selected, anddifferent size of candidate documents (

32, 64, and 128 ) aresetted. We calculated the dialogue success rate of the targetin the candidates top-1 or top-3. In addition, similar to theretrieval task, the Mean Reciprocal Rank (MRR) is also calcu-lated based on the position of the target in the ranking result.At the same time, the average number of dialogue turns andaverage rewards are also obtained.As shown in the Table 2, the DaLU-NLU module has ahigher dialogue success rate and smaller number of turnscompared to the MRC-NLU module with the same PM, be-cause it’s difﬁcult for the MRC model to accurately extractthe attributes and there may be implied attributes value in thedocument. Therefore more interaction is needed to improvethe select probability of the target.In the combinations of DaLU-NLU, DaPO-PM is signiﬁ-cantly superior to the random or ﬁxed policy which is impos-sible to generate appropriate actions for speciﬁc candidates,while DaPO-PM can ask attribute with higher uncertaintyand belief.Although MD3 has signiﬁcant advantages compared withothers and is very close to human’s performance, it’s perfor-mance reduced quickly when increasing the size of candidatedocuments to 64 or 128. Also, there is still a lot of room forimprovement on top-1 dialogue success rate and larger sizeof candidate documents.

Ablation Study

Document Representation

We use an Attribute-awareHAN (AaHAN) encoder for document representation, whichaccurately capture the key information for different attributes. By removing the attribute-aware mechanism, we only use theHAN encoder, which means each attribute share the sameparameters. As shown in Table 3, AaHAN has a signiﬁcantimprovement which demonstrates the importance of sufﬁ-cient document knowledge representation for such dialogue.

Encoder S1 S3 M T RAaHAN 0.83 0.97 0.90 3.42 1.73

HAN 0.33 0.63 0.52 4.89 0.67

Table 3: Dialogue test results for different encoder on docu-ment representation with candidate size of 32.

Dialog Policy

To demonstrate the necessary of attributeuncertainty γ t and attribute belief π t for dialog policy, weintroduce two ablation tests on PM.• w/o AU : DaPO without attribute uncertainty γ t .• w/o AB : DaPO without attribute belief π t .As shown in Table 2, both performance is signiﬁcantlydegraded over several metrics and different size of candidates.To make an accurately guess in shortest turns, we should notonly consider the higher attribute uncertainty, but also theattribute belief. Guess Threshold

By adjusting the threshold K which de-cides whether to make a guess, different policies can beobtained, as shown in the Table 4. When the threshold K islarge, it tends to make accurate guess and is not limited tothe shortest dialogue turns. Otherwise, the top-1 documentmaybe not accurate, but the dialogue turns is shorter. This isa contradictory problem. We ﬁnally select threshold value of . to obtain a sub-optimal policy. Analysis

At the end of each dialogue, a sorting results can be obtainedaccording to the Dialog-Level Doc Belief p t . For all 5k simu-lated dialogs, we visualize the dynamic change of the targetdocument rank (TDR), as shown in 3-a. We can see that fromtop to bottom, as dialogue goes on, the color of the block (a) TDR (MD3) (b) TDR (MD3-Rand) (c) TDR (MRC-Rand) (d) CDIE (MD3) (e) CDIE (MD3-Rand) (f) CDIE (MRC-Rand) Figure 3: Visualization of target document rank (TDR) and candidate documents information entropy (CDIE) dynamic changes.The ordinate represents the end of each dialogue turns. MD3-Rand and MRC-Rand denote DaLU or MRC with Rand policy.

Turn Dialog(MD3) R/P Dialog(MD3-Rand) R/P Dialog(MRC-Rand) R/P

Figure 4: Examples of different dialogue models with the same candidate documents. R and P means target document rank andprobability after each turn. The darker the color, the higher the rank, the lower the selection probability.

K S1 S3 M T R0.5

Table 4: Dialogue test results for different document guessthresholds with candidate size of 32.is gradually lighter, which means TDR is gradually higher.In addition, the candidate documents information entropy(CDIE) calculated by the Dialog-Level Doc Belief p t canalso be visualized, indicating uncertainty change of the se-lected document, as shown in 3-b. From top to bottom, as di-alogue goes on, the CDIE gradually decreases, which meansthe uncertainty of the guessed document become smaller. Itillustrates that our model is interpretable.In the comparison of several combinations, we can ﬁndthat MD3 has signiﬁcant advantages regardless of TDR andCDIE. Case Study

Figure 4 shows three dialogue samples between differentmodels with the same candidates, and dynamic changes ofthe target rank and probability. At the beginning of the dialogue, each document has thesame probability to be guessed as the target. In the MD3sample, the “ release year ” and “ directed by ” attributes areasked, so that the rank of the target document quickly risesto the ﬁrst place, and the probability value increases to 0.6.As for the MRC-Rand sample, it doesn’t ask the director,which has the greatest difference for candidates. And theprobability of selecting the target is always low. This is dueto the inaccuracy of the MRC model for attributes extractionand implied attribute information. It can be seen that MD3 isbetter than the other methods and more robust.

Conclusions

In this paper, we introduced a new multi-document drivendialogue task, and proposed a public benchmark dataset—GuessMovie. Besides, we investigated a multi-documentdriven dialogue model which can converse with the userand achieve the dialogue goal conditioned on both docu-ment engagement and user feedback. Additionally, althoughour model has signiﬁcant advantages over several strongbaselines. We hypothesize that there are several predeﬁnedattributes for documents and agent can only ask such ques-tions. How to extend the dialogue to more scenarios with lessrestriction is a further research. cknowledgments

We thank the anonymous reviewers for their insightful com-ments. The research is supported in part by the Natural Sci-ence Foundation of China (Grant No. 62076032) and Na-tional Key Research and Development Program of China(Grant No. 2020YFF0305302).

References

Burgener, R. 2006. Artiﬁcial neural network guessing methodand game. US Patent App. 11/102,105.Das, A.; Kottur, S.; Moura, J. M.; Lee, S.; and Batra, D. 2017.Learning cooperative visual dialog agents with deep rein-forcement learning. In

Proceedings of the IEEE internationalconference on computer vision , 2951–2960.De Vries, H.; Strub, F.; Chandar, S.; Pietquin, O.; Larochelle,H.; and Courville, A. 2017. Guesswhat?! visual object dis-covery through multi-modal dialogue. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 5503–5512.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Dhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.-N.; Ahmad,F.; and Deng, L. 2017. Towards End-to-End ReinforcementLearning of Dialogue Agents for Information Access. In

Proceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) , 484–495.Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; andWeston, J. 2018. Wizard of wikipedia: Knowledge-poweredconversational agents. arXiv preprint arXiv:1811.01241 .Ghazvininejad, M.; Brockett, C.; Chang, M.-W.; Dolan, B.;Gao, J.; Yih, W.-t.; and Galley, M. 2018. A knowledge-grounded neural conversation model. In

Thirty-Second AAAIConference on Artiﬁcial Intelligence .Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensionalityreduction by learning an invariant mapping. In , volume 2, 1735–1742. IEEE.Hu, H.; Wu, X.; Luo, B.; Tao, C.; Xu, C.; Wu, W.; and Chen,Z. 2018. Playing 20 Question Game with Policy-Based Re-inforcement Learning. In

Proceedings of the 2018 Confer-ence on Empirical Methods in Natural Language Processing ,3233–3242.Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas-tic optimization. arXiv preprint arXiv:1412.6980 .Li, X.; Lipton, Z. C.; Dhingra, B.; Li, L.; Gao, J.; and Chen,Y.-N. 2016. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688 .Li, Z.; Niu, C.; Meng, F.; Feng, Y.; Li, Q.; and Zhou, J. 2019.Incremental Transformer with Deliberation Decoder for Doc-ument Grounded Conversations. In

Proceedings of the 57thAnnual Meeting of the Association for Computational Lin-guistics , 12–21. Liu, H.; Yuan, C.; and Wang, X. 2020. Label-Wise Docu-ment Pre-training for Multi-label Text Classiﬁcation. In

CCFInternational Conference on Natural Language Processingand Chinese Computing , 641–653. Springer.Madotto, A.; Wu, C.-S.; and Fung, P. 2018. Mem2Seq: Ef-fectively Incorporating Knowledge Bases into End-to-EndTask-Oriented Dialog Systems. In

Proceedings of the 56thAnnual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , 1468–1478.Miller, A.; Fisch, A.; Dodge, J.; Karimi, A.-H.; Bordes, A.;and Weston, J. 2016. Key-Value Memory Networks for Di-rectly Reading Documents. In

Proceedings of the 2016 Con-ference on Empirical Methods in Natural Language Process-ing , 1400–1409.Pang, W.; and Wang, X. 2020a. Guessing State Tracking forVisual Dialogue. In

ECCV .Pang, W.; and Wang, X. 2020b. Visual Dialogue State Track-ing for Question Generation. In

AAAI .Parthasarathi, P.; and Pineau, J. 2018. Extending NeuralGenerative Conversational Model using External KnowledgeSources. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , 690–695.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In

Proceedings of the2014 conference on empirical methods in natural languageprocessing (EMNLP) , 1532–1543.Reddy, S.; Chen, D.; and Manning, C. D. 2019. CoQA: AConversational Question Answering Challenge.

Transactionsof the Association for Computational Linguistics

7: 249–266.Schatzmann, J.; Thomson, B.; Weilhammer, K.; Ye, H.; andYoung, S. 2007. Agenda-based user simulation for boot-strapping a POMDP dialogue system. In

Human LanguageTechnologies 2007: The Conference of the North AmericanChapter of the Association for Computational Linguistics;Companion Volume, Short Papers , 149–152. Association forComputational Linguistics.Williams, R. J. 1992. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Machinelearning

Proceedings of the International Conference on LearningRepresentations (ICLR) .Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,E. 2016. Hierarchical attention networks for document clas-siﬁcation. In

Proceedings of the 2016 conference of theNorth American chapter of the association for computationallinguistics: human language technologies , 1480–1489.Zhao, T.; and Eskenazi, M. 2016. Towards End-to-End Learn-ing for Dialog State Tracking and Management using DeepReinforcement Learning. In17th Annual Meeting of theSpecial Interest Group on Discourse and Dialogue