Abstractive Dialog Summarization with Semantic Scaffolds
AA BSTRACTIVE D IALOG S UMMARIZATIONWITH S EMANTIC S CAFFOLDS
Lin Yuan
Zhejiang University [email protected]
Zhou Yu
University of California, Davis [email protected] A BSTRACT
The demand for abstractive dialog summary is growing in real-world applications.For example, customer service center or hospitals would like to summarize cus-tomer service interaction and doctor-patient interaction. However, few researchersexplored abstractive summarization on dialogs due to the lack of suitable datasets.We propose an abstractive dialog summarization dataset based on MultiWOZ(Budzianowski et al., 2018). If we directly apply previous state-of-the-art doc-ument summarization methods on dialogs, there are two significant drawbacks:the informative entities such as restaurant names are difficult to preserve, and thecontents from different dialog domains are sometimes mismatched. To addressthese two drawbacks, we propose Scaffold Pointer Network (SPNet) to utilizethe existing annotation on speaker role, semantic slot and dialog domain. SPNetincorporates these semantic scaffolds for dialog summarization. Since ROUGEcannot capture the two drawbacks mentioned, we also propose a new evaluationmetric that considers critical informative entities in the text. On MultiWOZ, ourproposed SPNet outperforms state-of-the-art abstractive summarization methodson all the automatic and human evaluation metrics.
NTRODUCTION
Summarization aims to condense a piece of text to a shorter version, retaining the critical informa-tion. On dialogs, summarization has various promising applications in the real world. For instance,the automatic doctor-patient interaction summary can save doctors’ massive amount of time used forfilling medical records. There is also a general demand for summarizing meetings in order to trackproject progress in the industry. Generally, multi-party conversations with interactive communica-tion are more difficult to summarize than single-speaker documents. Hence, dialog summarizationwill be a potential field in summarization track.There are two types of summarization: extractive and abstractive. Extractive summarization selectssentences or phrases directly from the source text and merges them to a summary, while abstractivesummarization attempts to generate novel expressions to condense information. Previous dialogsummarization research mostly study extractive summarization (Murray et al., 2005; Maskey &Hirschberg, 2005). Extractive methods merge selected important utterances from a dialog to formsummary. Because dialogs are highly dependant on their histories, it is difficult to produce coherentdiscourses with a set of non-consecutive conversation turns. Therefore, extractive summarizationis not the best approach to summarize dialogs. However, most modern abstractive methods focuson single-speaker documents rather than dialogs due to the lack of dialog summarization corpora.Popular abstractive summarization dataset like CNN/Daily Mail (Hermann et al., 2015) is on newsdocuments. AMI meeting corpus (McCowan et al., 2005) is the common benchmark, but it only hasextractive summary.In this work, we introduce a dataset for abstractive dialog summarization based on MultiWOZ(Budzianowski et al., 2018). Seq2Seq models such as Pointer-Generator (See et al., 2017) haveachieved high-quality summaries of news document. However, directly applying a news summarizerto dialog results in two drawbacks: informative entities such as place name are difficult to captureprecisely and contents in different domains are summarized unequally. To address these problems,we propose Scaffold Pointer Network (SPNet). SPNet incorporates three types of semantic scaffolds1 a r X i v : . [ c s . C L ] O c t n dialog: speaker role, semantic slot, and dialog domain. Firstly, SPNet adapts separate encoderto attentional Seq2Seq framework, producing distinct semantic representations for different speakerroles. Then, our method inputs delexicalized utterances for producing delexicalized summary, andfills in slot values to generate complete summary. Finally, we incorporate dialog domain scaffoldby jointly optimizing dialog domain classification task along with the summarization task. We eval-uate SPNet with both automatic and human evaluation metrics on MultiWOZ. SPNet outperformsPointer-Generator (See et al., 2017) and Transformer (Vaswani et al., 2017) on all the metrics. ELATED W ORK
Rush et al. (2015) first applied modern neural models to abstractive summarization. Their approachis based on Seq2Seq framework (Sutskever et al., 2014) and attention mechanism (Bahdanau et al.,2015), achieving state-of-the-art results on Gigaword and DUC-2004 dataset. Gu et al. (2016) pro-posed copy mechanism in summarization, demonstrating its effectiveness by combining the advan-tages of extractive and abstractive approach. See et al. (2017) applied pointing (Vinyals et al., 2015)as copy mechanism and use coverage mechanism (Tu et al., 2016) to discourage repetition. Mostrecently, reinforcement learning (RL) has been employed in abstractive summarization. RL-basedapproaches directly optimize the objectives of summarization (Ranzato et al., 2016; Celikyilmazet al., 2018). However, deep reinforcement learning approaches are difficult to train and more proneto exposure bias (Bahdanau et al., 2017).Recently, pre-training methods are popular in NLP applications. BERT (Devlin et al., 2018) andGPT (Radford et al., 2018) have achieved state-of-the-art performance in many tasks, including sum-marization. For instance, Zhang et al. (2019) proposed a method to pre-train hierarchical documentencoder for extractive summarization. Hoang et al. (2019) proposed two strategies to incorporate apre-trained model (GPT) to perform the abstractive summarizer and achieved a better performance.However, there has not been much research on adapting pre-trained models to dialog summarization.Dialog summarization, specifically meeting summarization, has been studied extensively. Previouswork generally focused on statistical machine learning methods in extractive dialog summarization:Galley (2006) used skip-chain conditional random fields (CRFs) (Lafferty et al., 2001) as a rank-ing method in extractive meeting summarization. Wang & Cardie (2013) compared support vectormachines (SVMs) (Cortes & Vapnik, 1995) with LDA-based topic models (Blei et al., 2003) forproducing decision summaries. However, abstractive dialog summarization was less explored dueto the lack of a suitable benchmark. Recent work (Wang & Cardie, 2016; Goo & Chen, 2018; Panet al., 2018) created abstractive dialog summary benchmarks with existing dialog corpus. Goo &Chen (2018) annotated topic descriptions in AMI meeting corpus as the summary. However, topicsthey defined are coarse, such as “industrial designer presentation”. They also proposed a model witha sentence-gated mechanism incorporating dialog acts to perform abstractive summarization. More-over, Li et al. (2019) first built a model to summarize audio-visual meeting data with an abstractivemethod. However, previous work has not investigated the utilization of semantic patterns in dialog,so we explore it in-depth in our work.
ROPOSED M ETHOD
As discussed above, state-of-the-art document summarizers are not applicable in conversation set-tings. We propose Scaffold Pointer Network (SPNet) based on Pointer-Generator (See et al., 2017).SPNet incorporates three types of semantic scaffolds to improve abstractive dialog summarization:speaker role, semantic slot and dialog domain.3.1 B
ACKGROUND
We first introduce Pointer-Generator (See et al., 2017). It is a hybrid model of the typical Seq2Seq at-tention model (Nallapati et al., 2016) and pointer network (Vinyals et al., 2015). Seq2Seq frameworkencodes source sequence and generates the target sequence with the decoder. The input sequence isfed into the encoder token by token, producing the encoder hidden states h i in each encoding step.The decoder receives word embedding of the previous word and generates a distribution to decidethe target element in this step, retaining decoder hidden states s t . In Pointer-Generator, attention2istribution a t is computed as in Bahdanau et al. (2015): e ti = v T tanh ( W h h i + W s s t + b attn ) a t = softmax (cid:0) e t (cid:1) (1)where W h , W s , v and b attn are all learnable parameters.With the attention distribution a t , context vector h ∗ t is computed as the weighted sum of encoder’shidden states. Context vector is regarded as the attentional information in the source text: h ∗ t = (cid:88) i a ti h i (2)Pointer-Generator differs from typical Seq2Seq attention model in the generation process. The point-ing mechanism combines copying words directly from the source text with generating words from afixed vocabulary. Generation probability p gen is calculated as “a soft switch” to choose from copyand generation: p gen = σ (cid:0) w Th ∗ h ∗ t + w Ts s t + w Tx x t + b ptr (cid:1) (3)where x t is the decoder input, w h ∗ , w s , w x and b ptr are all learnable parameters. σ is sigmoidfunction, so the generation probability p gen has a range of [0 , .The ability to select from copy and generation corresponds to a dynamic vocabulary. Pointer net-work forms an extended vocabulary for the copied tokens, including all the out-of-vocabulary(OOV)words appeared in the source text. The final probability distribution P ( w ) on extended vocabularyis computed as follows: P vocab = softmax ( V (cid:48) ( V [ s t , h ∗ t ] + b ) + b (cid:48) ) P ( w ) = p gen P vocab ( w ) + (1 − p gen ) (cid:88) i : w i = w a ti (4)where P vocab is the distribution on the original vocabulary, V (cid:48) , V , b and b (cid:48) are learnable parametersused to calculate such distribution.3.2 S CAFFOLD P OINTER N ETWORK (SPN ET )Our Scaffold Pointer Network (depicted in Figure 1) is based on Pointer-Generator (See et al., 2017).The contribution of SPNet is three-fold: separate encoding for different roles, incorporating semanticslot scaffold and dialog domain scaffold.3.2.1 S PEAKER R OLE S CAFFOLD
Our encoder-decoder framework employs separate encoding for different speakers in the dialog.User utterances x usrt and system utterances x syst are fed into a user encoder and a system encoderseparately to obtain encoder hidden states h usri and h sysi . The attention distributions and contextvectors are calculated as described in section 3.1. In order to merge these two encoders in ourframework, the decoder’s hidden state s is initialized as: s = concat ( h usrT , h sysT ) (5)The pointing mechanism in our model follows the Equation 3, and we obtain the context vector h ∗ t : h ∗ t = concat ( (cid:88) i a usr t i h usri , (cid:88) i a sys t i h sysi ) (6)3.2.2 S EMANTIC S LOT S CAFFOLD
We integrate semantic slot scaffold by performing delexicalization on original dialogs. Delexicaliza-tion is a common pre-processing step in dialog modeling. Specifically, delexicalization replaces theslot values with its semantic slot name(e.g. replace 18:00 with [time]). It is easier for the languagemodeling to process delexicalized texts, as they have a reduced vocabulary size. But these generatedsentences lack the semantic information due to the delexicalization. Some previous dialog system3igure 1: SPNet overview. The blue and yellow box is the user and system encoder respectively. Theencoders take the delexicalized conversation as input. The slots values are aligned with their slotsposition. Pointing mechanism merges attention distribution and vocabulary distribution to obtainthe final distribution. We then fill the slots values into the slot tokens to convert the template to acomplete summary. SPNet also performs domain classification to improve encoder representation.research ignored this issue (Wen et al., 2015) or completed single delexicalized utterance (Sharmaet al., 2017) as generated response. We propose to perform delexicalization in dialog summary,since delexicalized utterances can simplify dialog modeling. We fill the generated templates withslots with the copy and pointing mechanism.We first train the model with the delexicalized utterance. Attention distribution a t over the sourcetokens instructs the decoder to fill up the slots with lexicalized values: value ( w slot ) = max value ( w i ): slot ( w i )= w slot a ti (7)Note that w slot specifies the tokens that represents the slot name (e.g. [hotel place], [time]). Decoderdirectly copies lexicalized value value ( w i ) conditioned on attention distribution a ti . If w is not aslot token, then the probability P ( w ) is calculated as Equation 4.3.2.3 D IALOG D OMAIN S CAFFOLD
We integrate dialog domain scaffold through a multi-task framework. Dialog domain indicatesdifferent conversation task content, for example, booking hotel, restaurant and taxi in MultiWOZdataset. Generally, the content in different domains varies so multi-domain task summarization ismore difficult than single-domain. We include domain classification as the auxiliary task to incorpo-rate the prior that different domains have different content. Feedback from the domain classificationtask provides domain specific information for the encoder to learn better representations. For do-main classification, we feed the concatenated encoder hidden state through a binary classifier withtwo linear layers, producing domain probability d . The i th element d i in d represents the probability4f the i th domain: d = σ ( U (cid:48) ( ReLU ( U [ h usrT , h sysT ] + b d )) + b (cid:48) d ) (8)where U , U (cid:48) , b d and b (cid:48) d are all trainable parameters in the classifier. We denote the loss function ofsummarization as loss and domain classification as loss . Assume target word at timestep t is w ∗ t , loss is the arithmetic mean of the negative log likelihood of w ∗ t over the generated sequence: loss = 1 T T (cid:88) t =0 − log P ( w ∗ t ) (9)The domain classification task is a multi-label binary classification problem. We use binary crossentropy loss between the i th domain label ˆ d i and predict probability d i for this task: loss = 1 | D | | D | (cid:88) i =1 ˆ d i log d i + (1 − ˆ d i ) log (1 − d i ) (10)where | D | is the number of domains. Finally, we reweight the classification loss with hyperparame-ter λ and the objective function is: loss = loss + λloss (11) XPERIMENTAL S ETTINGS
ATASET
We validate SPNet on MultiWOZ-2.0 dataset (Budzianowski et al., 2018). MultiWOZ consistsof multi-domain conversations between a tourist and a information center clerk on varies bookingtasks or domains, such as booking restaurants, hotels, taxis, etc. There are 10,438 dialogs, spanningover seven domains. 3,406 of them are single-domain (8.93 turns on average) and 7,302 are multi-domain (15.39 turns on average). During MultiWOZ data collection, instruction is provided forcrowd workers to perform the task. We use the instructions as the dialog summary, and an exampledata is shown in Table 2. Dialog domain label is extracted from existing MultiWOZ annotation. Inthe experiment, we split the dataset into 8,438 training, 1,000 validation, and 1,000 testing.4.2 E
VALUATION M ETRICS
ROUGE (Lin, 2004) is a standard metric for summarization, designed to measure the surface wordalignment between a generated summary and a human written summary. We evaluate our modelwith ROUGE-1, ROUGE-2 and ROUGE-L. They measure the word-overlap, bigram-overlap, andlongest common sequence between the reference summary and the generated summary respectively.We obtain ROUGE scores using the files2rouge package . However, ROUGE is insufficient tomeasure summarization performance. The following example shows its limitations:Reference: You are going to [restaurant name] at [time].Summary: You are going to [restaurant name] at.In this case, the summary has a high ROUGE score, as it has a considerable proportion of wordoverlap with the reference summary. However, it still has poor relevance and readability, for leavingout one of the most critical information: [time]. ROUGE treats each word equally in computingn-gram overlap while the informativeness actually varies: common words or phrases (e.g. “You aregoing to”) significantly contribute to the ROUGE score and readability, but they are almost irrelevantto essential contents. The semantic slot values (e.g. [restaurant name], [time]) are more essentialcompared to other words in the summary. However, ROUGE did not take this into consideration.To address this drawback in ROUGE, we propose a new evaluation metric: Critical InformationCompleteness (CIC). Formally, CIC is a recall of semantic slot information between a candidatesummary and a reference summary. CIC is defined as follows: CIC = (cid:80) v ∈ V Count match ( v ) m (12) https://github.com/pltrdy/files2rouge odels ROUGE-1 ROUGE-2 ROUGE-L CIC base (Pointer-Gen) (See et al., 2017) 62.89 48.61 59.30 42.47Transformer (Vaswani et al., 2017) 63.12 50.63 61.04 42.84base + speaker role 72.01 60.55 68.40 53.08base + speaker role + semantic slot 90.68 83.54 84.36 70.25SPNet (base + speaker role + semanticslot + dialog domain) Table 1: Automatic evaluation results on MultiWOZ. We use Pointer-Generator as the base modeland gradually add different semantic scaffolds.where V stands for a set of delexicalized values in the reference summary, Count match ( v ) is thenumber of values co-occurring in the candidate summary and reference summary, and m is thenumber of values in set V . In our experiments, CIC is computed as the arithmetic mean over all thedialog domains to retain the overall performance.CIC is a suitable complementary metric to ROUGE because it accounts for the most important infor-mation within each dialog domain. CIC can be applied to any summarization task with predefinedessential entities. For example, in news summarization the proper nouns are the critical informationto retain.4.3 I MPLEMENTATION D ETAILS
We implemented our baselines with OpenNMT framework (Klein et al., 2017). We delexicalizeutterances according to the belief span annotation. To maintain the generalizability of SPNet, wecombine the slots that refer to the same information from different dialog domains into one slot (e.g.time). Instead of using pre-trained word embeddings like GloVe (Pennington et al., 2014), we trainword embeddings from scratch with a 128-dimension embedding layer. We set the hidden states ofthe bidirectional LSTM encoders to 256 dimensions, and the unidirectional LSTM decoder to 512dimension. Our model is optimized using Adam (Kingma & Ba, 2014) with a learning rate of 0.001, β = 0 . , β = 0 . . We reduce the learning rate to half to avoid overfitting when the validationloss increases. We set the hyperparameter λ to 0.5 in the objective function and the batch size toeight. We use beam search with a beam size of three during decoding. We use the validation set toselect the model parameter. Our model with and without multi-task takes about 15 epochs and sevenepochs to converge, respectively. ESULTS AND D ISCUSSIONS
UTOMATIC E VALUATION R ESULTS
To demonstrate SPNet’s effectiveness, we compare it with two state-of-the-art methods, Pointer-Generator (See et al., 2017) and Transformer (Vaswani et al., 2017). Pointer-Generator is the state-of-the-art method in abstractive document summarization. In inference, we use length penalty andcoverage penalty mentioned in Gehrmann et al. (2018). The hyperparameters in the original imple-mentation (See et al., 2017) were used. Transformer uses attention mechanisms to replace recurrencefor sequence transduction. Transformer generalizes well to many sequence-to-sequence problems,so we adapt it to our task, following the implementation in the official OpenNMT-py documentation.We show all the models’ results in Table 1. We observe that SPNet reaches the highest score in bothROUGE and CIC. Both Pointer-Generator and Transformer achieve high ROUGE scores, but a rel-ative low CIC scores. It suggests that the baselines have more room for improvement on preservingcritical slot information. All the scaffolds we propose can be applied to different neural networkmodels. In this work we select Pointer-Generator as our base model in SPNet because we observethat Transformer only has a small improvement over Pointer-Generator but is having a higher cost ontraining time and computing resources. We observe that SPNet outperforms other methods in all theautomatic evaluation metrics with a big margin, as it incorporates all the three semantic scaffolds.Semantic slot contributes the most to SPNet’s increased performance, bringing the largest increaseon all automatic evaluation metrics. 6 onversationClient:
I’m looking for a place to stay. It needs to be a guesthouse and include free wifi.
Help desk:
There are 23 hotels that meet your needs. Would you like to narrow your search by area and/or price range ?
Client:
I would like for it to be cheap and include free parking.
Help desk:
There are nine guesthouse hotels in various areas. What part of town are you hoping for?
Client:
Nothing in particular. I just need it booked for 6 people for a total of 4 nights starting from sunday. I’d also like thereference number, please.
Help desk:
Sure! We will book you in a hotel with 4 stars. The reservation will be for 6 people for 4 nights. How many rooms would youlike to reserve?
Client:
Well, for six people I suppose I will be needing three rooms, for four nights, starting Sunday.
Help desk:
You’re booked at the Alexander Bed and Breakfast, 517a coldham lane, for 6 people for four nights starting Sunday. Yourreference number is JXVKZ7KV.
Client:
Thank you! I’m also looking for a restaurant. Ideally an Italian place in the same price range in the centre.
Help desk:
I found Ask restaurant would you like to book that?
Client:
Is the name of the restaurant Ask restaurant?
Help desk:
Yes, the restaurant is called Ask restaurant. Would you like to book a table there?
Client:
Yes. Book it for the same group of people on the same day at 18:45. Can I get the reference number please?
Help desk:
Great! Your booking for 6 people at the restaurant Ask on Sunday at 18:45 was successful! Your reference number isY2Y8QYBY.
Client:
I am all set. Thank you.
Help desk:
Thank you for using the Cambridge TowneInfo Help Desk!
SummariesPointer-Generator :You are traveling to Cambridge and looking forward to try local restaurants. You are looking for a place to stay. The hotel should includefree wifi and should be in the type of guesthouse. The hotel does n’t need to include internet. If there is no such hotel, how about onethat has free wifi. Once you find the hotel you want to book it for 6 people and 4 nights starting from Sunday. If the booking fails howabout 1 nights. Make sure you get the reference number. You are also looking for information particular restaurant. Its name is called Askrestaurant restaurant. Make you you the price range and postcode.
SPNet :You are looking for information in Cambridge. You are looking for a place to stay. The hotel should be in the type of guest house andshould include free wifi. The hotel should be in the cheap price range. Once you find the hotel you want to book it for 6 people and 4nights starting from Sunday. Make sure you get the reference number. You are also looking for a restaurant. The restaurant should be inthe same price range as the hotel and should serve Italian food. The restaurant should be in the same area as the hotel. Once you find therestaurant you want to book a table for the same group of people at 18:45 on the same day. Make sure you get the reference number.
Ground truth :You are planning your trip in Cambridge. You are looking for a place to stay. The hotel should include free wifi and should be in thetype of guest house. The hotel should be in the cheap price range and should include free parking. Once you find the hotel you want tobook it for 6 people and 4 nights starting from Sunday. Make sure you get the reference number. You are also looking for a restaurant.The restaurant should be in the same price range as the hotel and should be in the centre. The restaurant should serve italian food. Onceyou find the restaurant you want to book a table for the same group of people at 18:45 on the same day. Make sure you get the referencenumber.
Table 2: An example dialog and Pointer-Generator, SPNet and ground truth summaries. We un-derline semantic slots in the conversation. Red denotes incorrect slot values and green denotes thecorrect ones.5.2 H
UMAN E VALUATION R ESULTS
We also perform human evaluation to verify if our method’s increased performance on automaticevaluation metrics entails better human perceived quality. We randomly select 100 test samples fromMultiWOZ test set for evaluation. We recruit 150 crowd workers from Amazon Mechanical Turk.For each sample, we show the conversation, reference summary, as well as summaries generated byPointer-Generator and SPNet to three different participants. The participants are asked to score eachsummary on three indicators: relevance, conciseness and readability on a 1 to 5 scale, and rank thesummary pair (tie allowed).We present human evaluation results in Table 3. In the scoring part, our model outperforms Pointer-Generator in all three evaluation metrics. SPNet scored better than Pointer-Generator on relevanceand readability. All generated summaries are relatively concise; therefore, they score very similarin conciseness. Ground truth is still perceived as more relevant and readable than SPNet results.However, ground truth does not get a high absolute score. From the feedback of the evaluators, wefound that they think that the ground truth has not covered all the necessary information in the con-versation, and the description is not so natural. This motivates us to collect a dialog summarizationdataset with high-quality human-written summaries in the future. Results in the ranking evaluationshow more differences between different summaries. SPNet outperforms Pointer-Generator with alarge margin. Its performance is relatively close to the ground truth summary.7 ummary Relevance Conciseness Readability
Ground truth 3.83 3.67 3.87Pointer-Gen(See et al., 2017) 3.56 3.58 3.64SPNet
SPNet vs. Pointer-Gen 61.3 30.3 8.3SPNet vs. Ground truth 32.3 48.0 19.7Table 3: The upper is the scoring part and the lower is the the ranking part. SPNet outperformsPointer-Generator in all three human evaluation metrics and the differences are significant, with theconfidence over 99.5% in student t test. In the ranking part, the percentage of each choice is shownin decimal. Win, lose and tie refer to the state of the former summary in ranking.5.3 C
ASE STUDY
Table 2 shows an example summary from all models along with ground truth summary. We observethat Pointer-Generator ignores some essential fragments, such as the restaurant booking information(6 people, Sunday, 18:45). Missing information always belongs to the last several domains (restau-rant in this case) in a multi-domain dialog. We also observe that separately encoding two speakersreduces repetition and inconsistency. For instance, Pointer-Generator’s summary mentions “freewifi” several times and has conflicting requirements on wifi. This is because dialogs has informa-tion redundancy, but single-speaker model ignores such dialog property.Our method has limitations. In the example shown in Table 2, our summary does not mention thehotel name (Alexander Bed and Breakfast) and its address (517a Coldham Lane) referred in thesource. It occurs because the ground truth summary doe not cover it in the training data. As asupervised method, SPNet is hard to generate a summary containing additional information beyondthe ground truth. However, in some cases, SPNet can also correctly summarize the content notcovered in the reference summary (see Table 6 in Appendix).Furthermore, although our SPNet achieves a much-improved performance, the application of SPNetstill needs extra annotations for semantic scaffolds. For a dialog dataset, speaker role scaffold is anatural pattern for modeling. Most multi-domain dialog corpus has the domain annotation. Whilefor texts, for example news, its topic categorization such as sports or entertainment can be used asdomain annotation. We find that semantic slot scaffold brings the most significant improvement, butit is seldom explicitly annotated. However, the semantic slot scaffold can be relaxed to any criticalentities in the corpus, such as team name in sports news or professional terminology in a technicalmeeting.
ONCLUSION AND F UTURE W ORK
We adapt a dialog generation dataset, MultiWOZ to an abstractive dialog summarization dataset.We propose SPNet, an end-to-end model that incorporates the speaker role, semantic slot and dialogdomain as the semantic scaffolds to improve abstractive summary quality. We also propose anautomatic evaluation metric CIC that considers semantic slot relevance to serve as a complementarymetric to ROUGE. SPNet outperforms baseline methods in both automatic and human evaluationmetrics. It suggests that involving semantic scaffolds efficiently improves abstractive summarizationquality in the dialog scene.Moreover, we can easily extend SPNet to other summarization tasks. We plan to apply semantic slotscaffold to news summarization. Specifically, we can annotate the critical entities such as personnames or location names to ensure that they are captured correctly in the generated summary. Wealso plan to collect a human-human dialog dataset with more diverse human-written summaries.8
EFERENCES
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In
ICLR 2015 : International Conference on Learning Represen-tations 2015 , 2015.Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,Aaron C. Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In
ICLR 2017 : International Conference on Learning Representations 2017 , 2017.David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
Journal of MachineLearning Research , 3:993–1022, 2003.Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iigo Casanueva, Stefan Ultes, OsmanRamadan, and Milica Gai. Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 , 2018.Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. Deep communicating agents forabstractive summarization. In
NAACL HLT 2018: 16th Annual Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Human Language Technologies ,volume 1, pp. 1662–1675, 2018.Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine Learning , 20(3):273–297,1995.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.Michel Galley. A skip-chain conditional random field for ranking meeting utterances by importance.In
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing ,pp. 364–372, 2006.Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. Bottom-up abstractive summarization.In
EMNLP 2018: 2018 Conference on Empirical Methods in Natural Language Processing , pp.4098–4109, 2018.Chih-Wen Goo and Yun-Nung Chen. Abstractive dialogue summarization with sentence-gated mod-eling optimized by dialogue acts. arXiv preprint arXiv:1809.05715 , 2018.Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. Incorporating copying mechanism insequence-to-sequence learning. In
Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , volume 1, pp. 1631–1640, 2016.Karl Moritz Hermann, Tom Koisk, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suley-man, and Phil Blunsom. Teaching machines to read and comprehend. In
NIPS’15 Proceedingsof the 28th International Conference on Neural Information Processing Systems - Volume 1 , pp.1693–1701, 2015.Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, and Yejin Choi. Efficient adaptation of pre-trained transformers for abstractive summarization. arXiv preprint arXiv:1906.00138 , 2019.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. Opennmt:Open-source toolkit for neural machine translation. In
Proceedings of ACL 2017, System Demon-strations , pp. 67–72, 2017.John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In
ICML ’01 Proceedings of theEighteenth International Conference on Machine Learning , pp. 282–289, 2001.Manling Li, Lingyu Zhang, Heng Ji, and Richard J. Radke. Keep meeting summaries on topic:Abstractive multi-modal meeting summarization. In
ACL 2019 : The 57th Annual Meeting of theAssociation for Computational Linguistics , pp. 2190–2196, 2019.9hin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In
Text SummarizationBranches Out: Proceedings of the ACL-04 Workshop , pp. 74–81, 2004.Sameer Maskey and Julia Hirschberg. Comparing lexical, acoustic/prosodic, structural and dis-course features for speech summarization. In
INTERSPEECH , pp. 621–624, 2005.I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain,J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, Den-nis Reidsma, P. Wellner, L.P.J.J. Noldus, F Grieco, L.W.S. Loijens, and P.H. Zimmerman. Theami meeting corpus.
Symposium on Annotating and Measuring Meeting Behavior , pp. 137–140,2005.Gabriel Murray, Steve Renals, and Jean Carletta. Extractive summarization of meeting recordings.In
INTERSPEECH , pp. 593–596, 2005.Ramesh Nallapati, Bowen Zhou, Ccero Nogueira dos Santos, aglar Glehre, and Bing Xiang. Ab-stractive text summarization using sequence-to-sequence rnns and beyond. In
Proceedings of The20th SIGNLL Conference on Computational Natural Language Learning , pp. 280–290, 2016.Haojie Pan, Junpei Zhou, Zhou Zhao, Yan Liu, Deng Cai, and Min Yang. Dial2desc: End-to-enddialogue description generation. arXiv preprint arXiv:1811.00185 , 2018.Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for wordrepresentation. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pp. 1532–1543, 2014.Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un-derstanding by generative pre-training.
URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf , 2018.Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level train-ing with recurrent neural networks. In
ICLR 2016 : International Conference on Learning Rep-resentations 2016 , 2016.Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractivesentence summarization. arXiv preprint arXiv:1509.00685 , 2015.Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization withpointer-generator networks. In
Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , volume 1, pp. 1073–1083, 2017.Shikhar Sharma, Jing He, Kaheer Suleman, Hannes Schulz, and Philip Bachman. Natural languagegeneration in dialogue using lexicalized and delexicalized data. In
International Conference onLearning Representations (ICLR) Workshop , 2017.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks.In
Advances in Neural Information Processing Systems 27 , pp. 3104–3112, 2014.Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling coverage for neuralmachine translation. arXiv preprint arXiv:1601.04811 , 2016.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in Neural InformationProcessing Systems , pp. 5998–6008, 2017.Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In
NIPS’15 Proceedingsof the 28th International Conference on Neural Information Processing Systems - Volume 2 , pp.2692–2700, 2015.Lu Wang and Claire Cardie. Domain-independent abstract generation for focused meeting sum-marization. In
Proceedings of the 51st Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pp. 1395–1405, 2013.10u Wang and Claire Cardie. Summarizing decisions in spoken meetings. arXiv preprintarXiv:1606.07965 , 2016.Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei hao Su, David Vandyke, andSteve J. Young. Stochastic language generation in dialogue using recurrent neural networks withconvolutional sentence reranking. In
Proceedings of the 16th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue , pp. 275–284, 2015.Xingxing Zhang, Furu Wei, and Ming Zhou. Hibert: Document level pre-training of hierarchicalbidirectional transformers for document summarization. In
ACL 2019 : The 57th Annual Meetingof the Association for Computational Linguistics , pp. 5059–5069, 2019.11 S UPPLEMENT TO C ASE S TUDY
Supplement SummaryTransformer : You are planning your trip in Cambridge. You are looking for a place to stay. The hoteldoesn’t need to include internet and should include free parking. The hotel should be in the type of guest-house. If there is no such hotel, how about one that is in the moderate price range? Once you find thehotel, you want to book it for 6 people and 4 nights starting from Sunday. Make sure you get the referencenumber. You are also looking forward to dine. The restaurant should be in the centre. Make sure you getthe reference number.
Human Evaluation
Which one is better in your opinion, summary 1 (Pointer-Generator) or summary 2 (SPNet) or tie?
Evaluator : Summary 2 (SPNet)What is the reason for your rank? We recommend you to talk about it from relevance, conciseness andreadability.
Evaluator : Summary 2 is more coherent and relevant. It is just as concise but more readable.Which one is better in your opinion, summary 2 (SPNet) or summary 3 (Ground Truth) or tie?
Evaluator : TieWhat is the reason for your rank? We recommend you to talk about it from relevance, conciseness andreadability.
Evaluator : Summary 2 and 3 are almost identical. They are equally coherent, relevant, and readable.
Table 4: Supplement to the case in Table 2. The summary generated by Transformer is shown insupplement summary. Red denotes incorrect slot values and green denotes the correct ones. HumanEvaluation part provides the evaluator’s choice and feedback in ranking summary pairs. Content inthe brackets is not shown to the evaluators. 12 D IALOG S UMMARIZATION C ASES
ConversationClient : I’m looking for a moderately priced place to eat that’s in the centre of town.
Help desk : What type of cuisine are you looking for? There are 21 restaurants in that area.
Client : I don’t have a preference.
Help desk : Out of the 21 restaurant choices, one is the Yippee Noodle Bar which is moderately priced inthe centre of town. Would you like to make a reservation?
Client : That sounds great, what is the postcode?
Help desk : Their postcode is cb11lh and if you want a reservation just let me know the time, party size,and day and I’ll make it for you.
Client:
Not now, thanks. Here’s a weird one, though - are there any expensive 0-star hotels in the center oftown?
Help desk : I’m sorry, but there are no expensive 0-star hotels in the area. Can I help you find somethingelse?
Client : Do you have a hotel in the moderate price range?
Help desk : The Cityroomz Hotel is moderatly priced, and has free internet and parking. Would you like tomake a reservation now?
Client : That sounds nice, yes please book a room for 3 guests staying for 5 nights starting on Sunday.
Help desk : I’m sorry, those particular dates are filled, if you’d like I could try to make the trip shorter ormaybe change days?
Client : How about just for the one night?
Help desk : Alright, I was able to book your hotel room at the Cityroomz, for 3 people, for one night. Yourreference number is E62IXZGB. Is there anything else I can do?
Client : I will also need a taxi to go from the restaurant to the hotel, I want to leave the restaurant at 02:15.
Help desk : I have booked your taxi. It will be a blue volvo and the contact number is 07729446320.
Client : Great. Thanks. That’s all I need today.
Help desk : You’re welcome. Have a wonderful visit.
SummariesPointer-Generator :You are traveling to Cambridge. You are looking for a restaurant. The restaurant should serve 1 food andshould be in the moderate price range. Make sure you get postcode and address. You are also looking forinformation in the centre. The hotel should have a star of 0 and should include free wifi. The hotel doesn’tneed to include internet. If there is no such hotel, how about one that doesn’t have free parking? Once youfind the hotel, you want to book it for 3 people and 5 nights starting from Sunday. If the booking fails, howabout 1 night? Make sure you get the reference number.
SPNet :You are traveling to Cambridge and looking forward to try local restaurants. You are looking for a place todine. The restaurant should be in the moderate price range and should be in the centre. Make sure you getpostcode. You are also looking for a place to stay. The hotel should be in the type of hotel and should have0-star. The hotel should be in the same area as the restaurant and should be in the same price range as therestaurant. Once you find the hotel, you want to book it for 3 people and 5 nights starting from Sunday. Ifthe booking fails, how about 1 night? Make sure you get the reference number. You also want to book a taxito commute between the 2 places. You want to leave the restaurant by 02:15. Make sure you get contactnumber and car type.
Ground Truth :You are traveling to Cambridge and looking forward to try local restaurants. The restaurant should be inthe centre and should be in the moderate price range. Make sure you get postcode. You are also lookingfor a place to stay. The hotel should be in the expensive price range and should have a star of 0. The hotelshould be in the same area as the restaurant. If there is no such hotel, how about one that is in the moderateprice range? Once you find the hotel you want to book it for 3 people and 5 nights starting from Sunday. Ifthe booking fails how about 1 night. Make sure you get the reference number. You also want to book a taxito commute between the 2 places. You want to leave the restaurant by 02:15. Make sure you get contactnumber and car type.
Table 5: An example dialog and Pointer-Generator, SPNet and ground truth summaries. The dialogspans over three domains: restaurant, hotel and taxi. We underline semantic slots in the conversation.Red denotes incorrect slot values and green denotes the correct ones.13 onversationClient : I would like a British food restaurant in the centre.
Help desk : Sure, there are 7 restaurants that meet your needs. 4 are moderate and 3 are expensive. Do youhave a preference?
Client : Only the best for my family .. We’ll take the expensive one. Book us a table for 5 at 14:00 onThursday.
Help desk : I’m sorry I am having difficulty making a reservation for you. Shall we try another time orrestaurant type?
Client : Let’s try Italian instead.
Help desk : Caffe Uno is a very nice, expensive Italian restaurant in the center of town. Would you like atable there?
Client : Actually, I change my mind. I think I want to stick with British food after all. Can you suggest anyone that’s in the centre of town?
Help desk : Fitzbillies Restaurant, the Cambridge Chop House, and Midsummer House are all centrallylocated, British, and very expensive, but none have a table for 5 available on Thursday at 14:00.
Client : Can you try the Cambridge Chop House for Thursday for 5 people at 13:00 instead?
Help desk : Your reservation at the Cambridge Chop House for 5 people at 13:00 on Thursday was success-ful. Your reference number is U6GV5ZZV. Is there anything else I can help you with today?
Client : No, that’s all I need. Thanks for your help!
Help desk : You’re welcome, thank you for using our system.
Client : You’re welcome! It was great!
Help desk : Have a good day!
SummariesPointer-Generator :You are looking for a restaurant. The restaurant should serve British food and should be in the centre. Onceyou find the restaurant you want to book a table for 5 people at 14:00 on Thursday. Make sure you get thereference number.
SPNet :You are looking for a place to dine. The restaurant should serve British food and should be in the centre. Therestaurant should be in the expensive price range. If there is no such restaurant, how about one that servesBritish food? Once you find the restaurant you want to book a table for 5 people at 13:00 on Thursday. Ifthe booking fails, how about different restaurant in the same area and price range? Make sure you get thereference number.
Ground Truth :You are looking for a restaurant. The restaurant should serve British food and should be in the centre. Onceyou find the restaurant, you want to book a table for 5 people at 14:00 on Thursday. If the booking fails,how about 13:00? Make sure you get the reference number.:You are looking for a restaurant. The restaurant should serve British food and should be in the centre. Onceyou find the restaurant, you want to book a table for 5 people at 14:00 on Thursday. If the booking fails,how about 13:00? Make sure you get the reference number.