Automatic Recognition of the General-Purpose Communicative Functions defined by the ISO 24617-2 Standard for Dialog Act Annotation
GGeneral-Purpose Communicative Function Recognition
Automatic Recognition of the General-PurposeCommunicative Functions defined by the ISO 24617-2Standard for Dialog Act Annotation
Eug´enio Ribeiro [email protected]
INESC-ID Lisboa, PortugalInstituto Superior T´ecnico, Universidade de Lisboa, Portugal
Ricardo Ribeiro [email protected]
INESC-ID Lisboa, PortugalInstituto Universit´ario de Lisboa (ISCTE-IUL), Portugal
David Martins de Matos [email protected]
INESC-ID Lisboa, PortugalInstituto Superior T´ecnico, Universidade de Lisboa, Portugal
Abstract
ISO 24617-2, the standard for dialog act annotation, defines a hierarchically organizedset of general-purpose communicative functions. The automatic recognition of these func-tions, although practically unexplored, is relevant for a dialog system, since they providecues regarding the intention behind the segments and how they should be interpreted. Weexplore the recognition of general-purpose communicative functions in the DialogBank,which is a reference set of dialogs annotated according to this standard. To do so, we pro-pose adaptations of existing approaches to flat dialog act recognition that allow them todeal with the hierarchical classification problem. More specifically, we propose the use of ahierarchical network with cascading outputs and maximum a posteriori path estimation topredict the communicative function at each level of the hierarchy, preserve the dependenciesbetween the functions in the path, and decide at which level to stop. Furthermore, sincethe amount of dialogs in the DialogBank is reduced, we rely on transfer learning processesto reduce overfitting and improve performance. The results of our experiments show thatthe hierarchical approach outperforms a flat one and that each of its components plays animportant role towards the recognition of general-purpose communicative functions.
1. Introduction
From the perspective of a dialog system, it is important to identify the intention behindthe segments in a dialog, since it provides an important cue regarding the informationthat is present in the segments and how they should be interpreted. According to Searle(1969), that intention behind the uttered words is revealed by the corresponding dialogacts, which he defines as the minimal units of linguistic communication. Consequently,automatic dialog act recognition is an important task in the context of Natural LanguageProcessing (NLP), which has been widely explored over the years. In an attempt to set theground for more comparable research in the area, Bunt, Alexandersson, Choe, Fang, Hasida,Petukhova, Popescu-Belis, and Traum (2012) defined the ISO 24617-2 standard for dialogact annotation. However, annotating dialogs according to this standard is an exhaustiveprocess, especially since the annotation of each segment does not consist of a single dialog a r X i v : . [ c s . C L ] J a n ibeiro, Ribeiro, & Martins de Matos act label, which in the standard nomenclature is called a communicative function, butrather of a complex structure which includes information regarding the semantic dimensionof the dialog acts and relations with other segments, among other aspects. Consequently,the amount of data annotated according to the standard is still small and the automaticrecognition of its communicative functions remains practically unexplored.In this article, we explore the automatic recognition of communicative functions in theEnglish dialogs available in the DialogBank (Bunt, Petukhova, Malchanau, Wijnhoven, &Fang, 2016; Bunt, Petukhova, Malchanau, Fang, & Wijnhoven, 2019), which, to the bestof our knowledge, is the only publicly available source of dialogs fully annotated accordingto the standard. We focus on general-purpose communicative functions, since they arepredominant and, contrarily to the dialog act labels of widely explored corpora in dialogact recognition research, they pose a hierarchical classification problem, with paths thatmay not end on a leaf communicative function.To approach the problem, we propose modifications to existing approaches on dialog actrecognition that allow them to deal with the hierarchical classification problem posed bythe general-purpose communicative functions defined by the ISO 25617-2 standard. Thesemodifications focus on the ability to predict communicative functions at the multiple lev-els of the hierarchy, identify when the available information is not enough to predict morespecific functions, and preserve the dependencies between the functions in the path. Fur-thermore, given the reduced amount of annotated dialogs provided by the DialogBank, weexplore the use of additional dialogs annotated using mapping processes. These processesdo not lead to a complete annotation according to the standard, but the dialogs can stillprovide relevant information for training automatic approaches to communicative functionrecognition. Additionally, we rely on pre-trained dialog act recognition models by usingthem in transfer learning processes. This way, we can take advantage of their ability tocapture generic intention information and focus on identifying that which is most relevantfor recognizing the general-purpose communicative functions defined by the standard.In the remainder of the article, we start by providing an overview on the ISO 25617-2standard and on dialog act recognition approaches in Section 2. Then, in Section 3, wedescribe our approach for predicting the general-purpose communicative functions of thestandard. Section 4 describes our experimental setup, including the datasets, evaluationmethodology, and implementation details. The results of our experiments are presented anddiscussed in Section 5. Finally, Section 6 summarizes the contributions of the article andprovides pointers for future work.
2. Related Work
Since we are exploring the automatic recognition of the communicative functions definedby the standard for dialog act annotation, in this section we start by providing an overviewon that standard. Then, we discuss state-of-the-art approaches on dialog act recognitionand the few which have been applied to communicative function recognition.
ISO 24617-2, the ISO standard for dialog act annotation (Bunt et al., 2012; Bunt, Petukhova,Traum, & Alexandersson, 2017) aims at setting the ground for more comparable research eneral-Purpose Communicative Function Recognition A c t i on - D i sc u ss i on F un c t i on s G ene r a l - P u r po s e C o mm un i c a t i v e F un c t i on s I n f o r m a t i on - T r an s f e r F un c t i on s I n f o r m a t i on - S ee k i ng F un c t i on s I n f o r m a t i on - P r o v i d i ng F un c t i on s C o mm i ss i v e s D i r e c t i v e s QuestionInform PropositionalQuestionSet QuestionChoice Question Check QuestionTest QuestionAnswerAgreementDisagreement ConfirmDisconfirmAddressRequestOfferAddressSuggestion PromiseAccept RequestReject RequestAcceptSuggestionRejectSuggestionSuggestionRequest Instruct Address Offer Accept OfferReject Offer
Figure 1: Hierarchy of general-purpose communicative functions defined by the ISO 24617-2standard for dialog act annotation.in the area. According to it, dialog act annotations should not be performed on turns orutterances, but rather on functional segments (Carroll & Tanenhaus, 1978). Furthermore,the annotation of each segment does not consist of a single label or set of labels, but ratherof a complex structure containing information about the participants, relations with otherfunctional segments, the semantic dimension of the dialog act, its communicative function,and optional qualifiers concerning certainty, conditionality, partiality, and sentiment. ibeiro, Ribeiro, & Martins de Matos The standard defines nine semantic dimensions –
Task , Auto-Feedback , Allo-Feedback , Turn Management , Time Management , Discourse Structuring , Own Communication Man-agement , Partner Communication Management , and
Social Obligations Management – inwhich different communicative functions may occur. These communicative functions are thestandard equivalent of the dialog act labels used to annotate dialogs before the introductionof the standard. They are divided into general-purpose and dimension-specific functions.The former can occur in any semantic dimension and are organized hierarchically as shownin Figure 1. The latter can only occur in the corresponding dimension. The latter canonly occur in the corresponding dimension and are distributed as shown in Table 1. Of thenine semantic dimensions, only the
Task dimension does not have specific functions. Thismeans that only general-purpose communicative functions occur in that dimension. Fur-thermore, with the exception of the functions specific to the
Social Obligations Management dimension, which can be split into their initial and return counterparts, dimension-specificfunctions are all at the same level.Annotating all of the aspects defined by the standard is an exhaustive process and,consequently, the amount of available data is still reduced and, in many cases, not all ofthe aspects are considered (e.g. Petukhova, Gropp, Klakow, Schmidt, Eigner, Topf, Srb,Motlicek, Potard, Dines, Deroo, Egeler, Meinz, & Liersch, 2014; Bunt et al., 2016, 2019;Anikina & Kruijff-Korbayov´a, 2019). To the best of our knowledge, the DialogBank (Buntet al., 2016, 2019) is the only publicly available source of dialogs fully annotated according tothe standard. It features (re)-annotated dialogs from several corpora, but, currently, thereare only 15 dialogs in English and 9 in Dutch, which amount to less than 3,000 segments.
Given a turn, utterance, or functional segment in a dialog, to which we will refer generi-cally as segment in the remainder of the article, automatic dialog act recognition aims atidentifying the intention behind that segment. It is a task that has been widely exploredover the years, using both classical machine learning and deep learning approaches. In bothcases, the approaches differ mainly on how the representation of a segment is generatedfrom the representations of its tokens and how they are able to weigh context informationin the decision process. The article by Kr´al and Cerisara (2010) provides a comprehensiveoverview on classical machine learning approaches on the task, except for the more recentSupport Vector Machine (SVM)-based approaches (Gamb¨ack, Olsson, & T¨ackstr¨om, 2011;Ribeiro, Ribeiro, & Martins de Matos, 2015).Regarding deep learning approaches, both Recurrent Neural Networks (RNNs) (e.g Lee& Dernoncourt, 2016; Ji, Haffari, & Eisenstein, 2016; Khanpour, Guntakandla, & Nielsen,2016; Tran, Zukerman, & Haffari, 2017b) and Convolutional Neural Networks (CNNs) (e.g.Kalchbrenner & Blunsom, 2013; Lee & Dernoncourt, 2016; Liu, Han, Tan, & Lei, 2017) havebeen used to generate segment representations by combining the embedding representationsof their words. While the first focus on capturing information from relevant sequences oftokens, the latter focus on the context surrounding each token and, thus, on capturing rele-vant token patterns independently of where they occur in the segment. We have compareddifferent RNN- and CNN-based segment representation approaches and achieved higherperformance on the task using a set of parallel CNNs with different window sizes, while also eneral-Purpose Communicative Function Recognition Semantic Dimension Communicative Functions
Auto-Feedback Auto PositiveAuto NegativeAllo-Feedback Allo PositiveAllo NegativeFeedback ElicitationOwn Communication Management RetractionSelf CorrectionSelf ErrorPartner Communication Management Correct MisspeakingCompletionTurn Management Turn AcceptTurn AssignTurn GrabTurn KeepTurn ReleaseTurn TakeTime Management StallingPausingDiscourse Structuring Interaction StructuringOpeningSocial Obligations Management GreetingSelf IntroductionApologyThankingGoodbyeTable 1: Dimension-specific communicative functions defined by the ISO 24617-2 standardfor dialog act annotation. ibeiro, Ribeiro, & Martins de Matos consuming less resources than when using RNNs (Ribeiro, Ribeiro, & Martins de Matos,2019b). Still, most of the more recent studies on the task rely on bidirectional RNNs forsegment representation (e.g. Kumar, Agarwal, Dasgupta, & Joshi, 2018; Chen, Yang, Zhao,Cai, & He, 2018; Li, Lin, Collinson, Li, & Chen, 2019; Raheja & Tetreault, 2019).Regarding the representation of the segment’s tokens, most approaches on dialog actrecognition using deep learning have relied on pre-trained word embedding representationsgenerated by Word2Vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) or GloVe (Pen-nington, Socher, & Manning, 2014). However, similarly to what happens on most NLP tasks,using contextualized word representations, especially those generated by BERT (Devlin,Chang, Kenton, & Toutanova, 2019), leads to higher performance (Ribeiro et al., 2019b).Additionally, there are studies which also rely on character-level information (e.g. Bothe,Weber, Magg, & Wermter, 2018; Chen et al., 2018; Li et al., 2019; Raheja & Tetreault,2019). In our studies on that matter, we observed that there are complementary cues forintention at a sub-word level which cannot be captured when relying solely on word-leveltokenization (Ribeiro, Ribeiro, & Martins de Matos, 2018a, 2019a). Finally, functional-leveltokenization, mainly in the form of Part-of-Speech (POS) tags, has also been explored forthe task (Chen et al., 2018; Ribeiro et al., 2019b). However, it seems to be less relevant.In terms of context information, previous studies have observed performance improve-ments when using information regarding turn-taking by the speakers (Liu et al., 2017;Ribeiro et al., 2019b) and the topics covered by the dialog (Li et al., 2019). Still, the mostimportant source of context information is the surrounding segments. Studies dedicatedto that matter have shown that the influence of preceding segments decreases with thedistance and that their dialog act classification is more informative than their words, evenwhen the classifications are obtained automatically (Ribeiro et al., 2015; Liu et al., 2017).Furthermore, we have shown that sequentiality information and long distance dependenciesbetween the preceding segments can be captured by using a RNN to generate a summaryof their classifications (Ribeiro et al., 2019b).In fact, considering the dependencies between the multiple segments in a dialog, severalstudies attempted to predict the sequence of dialog acts in a complete dialog by approach-ing the task as a sequence labeling problem (e.g. Bothe et al., 2018; Kumar et al., 2018).These cases rely on a hierarchical approach in which the representations of the segments areprovided to a conversation-level RNN that models the whole dialog. Furthermore, the per-formance can be improved by including attention mechanisms that identify the informationin the surrounding segments that is most relevant for predicting the dialog acts (e.g. Tranet al., 2017b; Chen et al., 2018; Li et al., 2019; Raheja & Tetreault, 2019). Tran, Zukerman,and Haffari (2017c) also observed that propagating uncertainty information concerning theprevious predictions can lead to the prediction of better dialog act sequences. Finally, mostof the studies that approach the task as a sequence labeling problem also rely on Condi-tional Random Fields (CRFs) (e.g. Kumar et al., 2018; Chen et al., 2018; Li et al., 2019;Raheja & Tetreault, 2019) or generative models (Tran, Zukerman, & Haffari, 2017a) as thefinal layer to predict the sequence of dialog acts. This way, the prediction of the dialog actfor a segment is further conditioned to the previous predictions. Overall, this kind of ap-proach achieves the highest performance on the task. However, the conversation-level RNNis typically bidirectional. This means that the prediction of the dialog act for a segment eneral-Purpose Communicative Function Recognition relies not only on the information from previous segments, but also from future ones, whichare not available to a dialog system.Additional studies on dialog act recognition explore alternative approaches or focus onspecific applications. For instance, Wan, Yan, Gao, Zhao, Wu, and Yu (2018) approachedthe task as a Question Answering (QA) problem and employed adversarial training. On theother hand, Ravi and Kozareva (2018) focused on developing dialog act recognition modelsthat can be used in mobile applications.Regarding the recognition of the communicative functions defined by the ISO 24617-2standard, to the best of our knowledge, besides us, only Anikina and Kruijff-Korbayov´a(2019) have addressed that problem. More specifically, they have explored the recogni-tion of a compressed set of eight communicative functions on the TRADR corpus (Kruijff-Korbayov´a, Colas, Gianni, Pirri, de Greeff, Hindriks, Neerincx, ¨Ogren, Svoboda, & Worst,2015). This compressed set merges functions in the Task , Turn Management , and
Feed-back dimensions and does not consider the hierarchical nature of general-purpose functions.Thus, the task was approached as a flat classification problem. The authors comparedthe performance of several Deep Neural Network (DNN) architectures and uncontextual-ized embedding approaches and observed the highest performance when the representationof the segment was generated by passing GloVe embeddings through a Long Short-TermMemory Unit (LSTM). However, the use of parallel CNNs was not explored and only largewindow sizes were considered. Furthermore, similarly to what was observed in previousstudies on dialog act recognition, using context information regarding the dialog history ledto improved performance. However, in this case, it was summarized as the average of theembedding representations of all the words in the dialog history.Given the reduced amount of dialogs in the DialogBank, we have mapped the dialogact labels of the LEGO corpus (Schmitt, Ultes, & Minker, 2012) into the communicativefunctions defined by the ISO 24617-2 standard. To assess the utility of these partiallyannotated dialogs, we have performed preliminary experiments on the automatic recognitionof general-purpose communicative functions in the DialogBank (Ribeiro, Ribeiro, & Martinsde Matos, 2020). In those experiments, we flattened the hierarchy and addressed the task asa flat dialog act recognition problem using the same approach we used to predict the originaldialog act labels of LEGO corpus (Ribeiro et al., 2019a). Then, we observed performanceimprovements when the LEGO dialogs were included in the training phase, in comparisonto when the classifier was trained solely on DialogBank dialogs.
3. General-Purpose Communicative Function Recognition
The main difference between general-purpose communicative function recognition and tra-ditional dialog act recognition is that the former poses a hierarchical classification problem,with paths that may not end on a leaf. However, both are intention recognition problemsat their core. Thus, we approach the problem by adapting existing dialog act recognitionapproaches to deal with hierarchical problems. This way, we build on the ability of thoseapproaches to capture generic information regarding intention. Furthermore, this allowsus to explore the use of existing pre-trained models in transfer learning processes, in anattempt to minimize the impact of the scarcity of annotated data. ibeiro, Ribeiro, & Martins de Matos Tokenize + Generate TokenRepresentations Generate SegmentRepresentationDecorate with ContextInformationIdentify Dialog ActDialog SegmentDialog Act
Figure 2: The generic dialog act recognition process.Although the studies on dialog act recognition covered in Section 2.2 explored differentaspects that are relevant for the task, most approaches can be summarized as the genericfour-step process shown in Figure 2. Given a segment in a dialog, the first step towards theidentification of the main dialog act that it communicates is to split it into its constituenttokens and generate adequate representations for each of them. As discussed in Section 2.2,the tokenization is typically performed at the word-level. However, other tokenizations, suchas at the character or functional levels, can be used to provide complementary information.In such cases, all the different tokenizations of the segment are considered in the subsequentsteps of the process. The next step is to generate a representation of the segment bycombining the representations of its tokens. Ideally, this representation should focus oncapturing the characteristics of the segment that are relevant to identify the intention thatit transmits. Then, the representation is decorated with context information regarding thedialog history and speaker information. This context information can be provided in theform of external features or by the recognition process itself in a recurrent manner. Finally,the information provided by the decorated segment representation is used to identify thedialog act communicated by the segment.Since the segment representation decorated with context information focuses on pro-viding information that allows the identification of the intention behind the segment, allthe steps towards its generation are relevant for the recognition of both traditional dialogacts and general-purpose communicative functions. Consequently, the adaptation of exist-ing dialog act recognition approaches to the recognition of general-purpose communicativefunctions refers to how that decorated segment representation can be specialized to allowthe identification of the hierarchically structured functions. A possible approach, which doesnot require modifications to the architecture of existing dialog act recognition approaches,is to simply flatten the hierarchy. This way, the recognition of general-purpose commu-nicative functions can be approached as a flat single-label classification problem. However,this means that the relations between each communicative function and its ancestors anddescendants are not considered. Thus, we use it as a baseline for assessing the ability of ourapproach to capture information regarding hierarchical dependencies and use it to improvethe ability to recognize general-purpose communicative functions. eneral-Purpose Communicative Function Recognition Level 1 Specialization (Fully Connected + Dropout)
Level 2 Specialization (Fully Connected + Dropout)
Level 2 Output (Softmax)
Level 1 Output (Softmax)
Level D Specialization (Fully Connected + Dropout)
Level D Output (Softmax)
Output Processing (Filtering + MAP)
Communicative Function PathDecorated Segment Representation ......
Figure 3: Our adaptation of a generic dialog act recognition approach to deal with thehierarchical problem posed by the ISO 24617-2 general-purpose communicativefunctions. The input is a segment representation decorated with context infor-mation. The circles represent concatenation operations. ibeiro, Ribeiro, & Martins de Matos The top performing dialog act recognition approaches are based on deep neural networks.In this context, when dealing with the multi-class single-label classification problems posedby most dialog act annotations, the output layer applies the softmax activation function toobtain a probability distribution of the classes. The dialog act of a given segment is thenpredicted by selecting the class with highest probability. As shown in Figure 3, in order toconsider the hierarchical structure of the general-purpose communicative functions definedby the ISO 24617-2 standard, we propose to use an output layer per level of the hierarchyinstead of using a single output layer. This way, each output layer focuses on distinguishingbetween communicative functions at the corresponding level without having to deal withthe ambiguity caused by functions that are ancestors or descendants of each other.Additionally, we introduce a specialization layer per level, which is a fully connectedlayer that, as the name suggests, specializes the decorated segment representation by cap-turing the information that is most relevant for distinguishing the communicative functionsat that level of the hierarchy. Furthermore, these layers are used to reduce the probabilityof overfitting by applying dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhut-dinov, 2014) during the training phase. The use of specialization layers has already beenproved important in our studies on automatic dialog act recognition (Ribeiro, Ribeiro, &Martins de Matos, 2018b; Ribeiro et al., 2019b), including those on the DIHANA cor-pus (Bened´ı, Lleida, Varona, Castro, Galiano, Justo, de Letona, & Miguel, 2006), which isannotated for dialog acts at three different levels.The final modification to existing network architectures refers to the use of cascadingoutputs, that is, the probability distribution predicted by the network at a given level isappended to the decorated segment representation before it is passed to the specializa-tion layer of the next level. This way, the network can capture information regarding thehierarchical dependencies between communicative functions at different levels.The general-purpose communicative functions defined by the ISO 24617-2 standard fol-low a strict hierarchy, the classification of a segment does not necessarily end on a leaf, andthe leaves are not all at the same level. Thus, the network must be able to predict pathswith variable length. To approach this problem, we add an additional label,
None , to eachlevel of the hierarchy, to represent that there is no communicative function attributed to thesegment at that level. This way, we are able to simulate paths with fixed length, while in-troducing minimal impact on the network during the training phase. The drawback is thatthe
None label may become the predominant one in the deeper levels, biasing the outputlayers towards its prediction. These additional labels are also considered when providingcontext information to the network, in order to have fixed dimensionality.During the inference phase, the parent-child relations between the general-purpose com-municative functions must be considered in order to avoid predicting an invalid path. Thismeans that, when selecting the label at a given level, only the children of the label selectedfor the level above it can be considered. This restriction can be enforced using an iterativeapproach that starts by selecting the communicative function with highest probability atthe top level and then applies a mask on the predictions of the level below it, in order todiscard the communicative functions that are not children of the selected one. This processis then repeated for each level of the hierarchy. However, the performance of this approachis highly impaired when misclassifications occur in the upper levels of the hierarchy. eneral-Purpose Communicative Function Recognition To attenuate the impact of misclassifications in the upper levels of the hierarchy, weexplore a prediction approach based on Maximum a Posteriori (MAP) estimation in whichthe predicted communicative function for a given segment, s , is given byCommunicativeFunction( s ) = argmax f ∈ F D (cid:89) d =1 P ( L d = Path( f ) d | s ) , (1)where F is the set of general-purpose communicative functions, D is the depth of thehierarchy, and P ( L d = c | s ) is given by the softmax output corresponding to label c of theoutput layer corresponding to the level at depth d . That is, instead of iteratively selectinga communicative function at each level, we compute the posterior probability of all validpaths in the hierarchy and select that with highest probability.Since the classification of a segment does not necessarily end on a leaf and the leavesare not all at the same level, we also rely on the additional label None labels during theinference phase. For instance, given a segment s , the probability of selecting the path thatends in the Answer communicative function is given by P ( Answer | s ) = P ( L = Information-Transfer Functions | s ) × P ( L = Information-Providing Functions | s ) × P ( L = Inf orm | s ) × P ( L = Answer | s ) × P ( L = N one | s ) × P ( L = N one | s ) . (2)
4. Experimental Setup
This section describes our experimental setup, including the datasets, the evaluation method-ology, and the complete network architecture used in our experiments, including implemen-tation details that allow future reproduction of those experiments.
As discussed in Section 2.1, to the best of our knowledge, the DialogBank (Bunt et al.,2016, 2019) is the only publicly available source of dialogs annotated fully according to theISO 24617-2 standard guidelines. Thus, in our experiments, we use it as gold standardfor evaluating the performance of the different approaches. However, given the scarcityof annotated dialogs, we also rely on the LEGO-ISO dataset (Ribeiro et al., 2020), whichfeatures dialogs that were partially annotated with the communicative functions definedby the ISO 24617-2 standard through label mapping processes. Finally, we also rely onSwitchboard Dialog Act Corpus (Jurafsky, Shriberg, & Biasca, 1997), which is the mostexplored in dialog act recognition studies, to train models for transfer learning purposes.These datasets are described in further detail below. ibeiro, Ribeiro, & Martins de Matos Function MapTask Switchboard TRAINS DBOX Total
Inform 56 338 44 37 475Instruct 143 0 1 11 155Answer 35 30 16 31 112Propositional Question 26 11 2 25 64Set Question 7 12 13 28 60Accept Request 52 0 1 1 54Agreement 8 42 2 1 53Check Question 28 9 7 6 50Confirm 11 9 6 14 40Suggest 3 3 2 5 13Disconfirm 1 1 0 10 12Request 2 2 1 4 9Choice Question 3 0 1 4 8Correction 2 0 0 1 3Address Request 3 0 0 0 3Offer 0 0 0 2 2Decline Offer 0 0 0 1 1Disagreement 1 0 0 0 1Accept Offer 0 0 0 1 1Accept Suggest 0 0 0 1 1Promise 0 0 0 1 1
General-Purpose CFs
381 457 96 184 1,118
None
281 555 140 266 1,242
Total
662 1,012 236 450 2,360Table 2: Distribution of the general-purpose communicative functions defined by the ISO24617-2 standard in the DialogBank.
The DialogBank (Bunt et al., 2016, 2019) aims at collecting and providing dialogs annotatedfully according to the ISO 24617-2 standard guidelines. At the time of this study, it features(re-)annotated dialogs from four English corpora and four Dutch corpora. There is a totalof 15 annotated dialogs in English and 9 in Dutch. To avoid the issues regarding multilin-guality, we focus on the English dialogs in this study. Three of those dialogs are originallyfrom MapTask (Anderson, Bader, Bard, Boyle, Doherty, Garrod, Isard, Kowtko, McAllis-ter, Miller, et al., 1991), four are from Switchboard (Godfrey, Holliman, & McDaniel, 1992),three are from TRAINS (Allen & Schubert, 1991), and five are from DBOX (Petukhovaet al., 2014). In total, the dialogs contain 2,360 annotated segments, out of which 1,118have general-purpose communicative functions in the
Task dimension.The general-purpose communicative functions are distributed in the DialogBank accord-ing to Table 2. We can see that, overall, the distribution is highly unbalanced, with the eneral-Purpose Communicative Function Recognition most common, Inform , covering 42% of the segments that have a general-purpose commu-nicative function, while 10 of the functions that occur in the DialogBank occur in less than10 segments. The predominance of the
Inform communicative function becomes even moreapparent if we consider the paths in the hierarchy.
Answer , Agreement , Confirm , Discon-firm , and
Disagreement are descendants of
Inform . This means that of the segments thathave a general-purpose communicative function, 62% have the
Inform function.Another important aspect revealed in Table 2 is that the distribution of general-purposecommunicative functions is also highly unbalanced across the dialogs of the different corporathat are included in the DialogBank, even in terms of the most common communicativefunctions. For instance, not considering paths, 92% of the segments with the
Instruct and96% with the
Accept Request communicative functions belong to MapTask dialogs. On theother hand, 71% of the segments with the
Inform communicative function belong to Switch-board dialogs. This reveals the heterogeneity of the DialogBank which is representative ofthe different kinds of dialog that occur in human-human and human-machine interaction.Overall, although it includes dialogs from multiple corpora, the amount of data providedby the DialogBank is not enough for drawing solid conclusions from the results of DNN-basedapproaches trained solely on it, especially considering the hierarchical nature of the general-purpose communicative functions that we intend to recognize automatically. However, sincethese dialogs are the closest we have to a gold standard annotation, the evaluation of ourapproaches on general-purpose communicative function recognition must be based on theperformance on the DialogBank.
LEGO-ISO (Ribeiro et al., 2020) consists of 347 dialogs from the Let’s Go Bus InformationSystem (Raux, Bohus, Langner, Black, & Eskenazi, 2006), containing 14,186 utterancesannotated with the communicative functions defined by the ISO 24617-2 standard. Eachdialog features the system and a human user. Since the 9,803 system utterances are gen-erated through slot filling of fixed templates, they have no errors and contain casing andpunctuation information. In contrast, the transcriptions of the 5,103 user utterances wereobtained using an Automatic Speech Recognition (ASR) system and, consequently, are sub-ject to recognition errors and contain no casing nor punctuation information. However, aconcrete value for the transcription Word Error Rate (WER) was not revealed.The annotation with the standard’s communicative functions was obtained through themapping of the original dialog act annotations of the LEGO corpus (Schmitt et al., 2012).The mapping was based solely on the original labels and the transcriptions of the turns.This means that the annotation is performed on turns rather than on functional segmentsand that it does not cover every semantic dimension. Table 3 shows the distribution ofgeneral-purpose communicative functions in the corpus after the mapping. We can see that,although the number of segments is larger, the set of communicative functions covered bythe corpus is only a subset of that covered by the DialogBank. This is partially due to thelabel mapping process, which did not consider the specificities of certain segments, but, mostimportantly, it is due to the characteristics of the dialogs, which are highly focused on thetask. Thus, system segments typically have a communicative function that is a descendantof
Question , so that it can obtain all the information required to fulfill the task. On the ibeiro, Ribeiro, & Martins de Matos Function System Segments User Segments Total
Check Question 2,256 1 2,257Set Question 1,987 210 2,197Instruct 1,812 106 1,918Answer 0 1,462 1,462Inform 656 600 1,256Confirm 0 1,162 1,162Disconfirm 0 1,105 1,105Promise 277 0 277Request 54 85 139Suggest 40 0 40
General-Purpose CFs
None
Total
Inform , since they aim at providing that information to the system. Furthermore, the mostcommon communicative function is
Check Question because the system tries to confirmthat it understood every piece of information provided by the user correctly. Also given thefocus on the task, only 17% of the segments do not have a general-purpose communicativefunction, which contrasts with the 53% of the DialogBank.Since its communicative-function annotations were obtained through label mapping pro-cesses, this dataset cannot be used as a gold standard. Still, it is 20 times larger than theDialogBank in number of English dialogs and 6 times larger in number of annotated seg-ments. Thus, it provides a significant amount of data that, according to the results of ourpreliminary studies (Ribeiro et al., 2020), can be used during the training phase to improvethe performance on general-purpose communicative function recognition. However, givenits size in comparison to the DialogBank, classifiers may overfit to its characteristics.
The Switchboard Dialog Act Corpus (Jurafsky et al., 1997) is an annotated subset of theSwitchboard (Godfrey et al., 1992) corpus. It is the largest and most explored corpus anno-tated with dialog act information, consisting of 1,155 manually transcribed conversations,containing 223,606 segments. The conversations are between pairs of humans and covermultiple domains. The corpus is annotated for dialog acts using the domain-independentSWBD-DAMSL tag set, which features over 200 unique labels. However, most studies usea reduced set of 42 to 44 labels to obtain a higher inter-annotator agreement and higherexample frequencies per class. Table 4 shows this set of labels and its distribution in thecorpus. The total number of segments labeled with a dialog act is lower than the previously eneral-Purpose Communicative Function Recognition referred 223,606 since some segments are considered continuations of the previous one bythe same speaker and aggregated to it. Label Count Label Count
Statement-Non-Opinion 72,824 Collaborative Completion 699Acknowledgement 37,096 Repeat-Phrase 660Statement-Opinion 25,197 Open-Question 632Agreement 10,820 Rhetorical-Question 557Abandoned 10,569 Hold 540Appreciation 4,663 Reject 338Yes-No-Question 4,624 Negative Non-No Answer 292Non-Verbal 3,548 Non-understanding 288Yes Answer 2,934 Other Answer 279Conventional Closing 2,486 Conventional Opening 220Uninterpretable 2,158 Or-Clause 207Wh-Question 1,911 Dispreferred Answers 205No Answer 1,340 3rd-Party-Talk 115Response Acknowledgement 1,277 Offers / Options 109Hedge 1,182 Self-talk 102Declarative Yes-No-Question 1,174 Downplayer 100Other 1,074 Maybe 98Backchannel-Question 1,019 Tag-Question 93Quotation 934 Declarative Wh-Question 80Summarization 919 Apology 76Affirmative Non-Yes Answer 836 Thanking 67Action Directive 719
Total
Yes-No-Question , Yes Answer , and
No Answer , can be directly mapped into the
Proposi-tional Question , Confirm , and
Disconfirm communicative functions, respectively. Mappingsof this kind are possible for several other labels, not only to general-purpose communicativefunctions, but also to dimension-specific functions. Fang, Cao, Bunt, and Liu (2012) providefurther insight into the possible mapping between the dialog act labels of the SwitchboardDialog Act Corpus and the communicative functions defined by the ISO 24617-2 standard.Given the size of the corpus and the similarity of the intentions revealed by its dialogact label set, we use it in our experiments to train a flat dialog act recognition model,so that its weights can be used in transfer learning processes. This way, the probabilityof overfitting the representations to the characteristics of the training dialogs is reduced,which may improve the generalization ability of the classifiers. ibeiro, Ribeiro, & Martins de Matos In this study, we focus on the recognition of ISO 24617-2 general-purpose communicativefunctions in the DialogBank. Below, we describe the evaluation scenarios, the evaluationapproach, and the metrics.
Although general-purpose communicative functions may occur in any of the semantic di-mensions defined by the ISO 24617-2 standard, we focus on the
Task dimension, since thenumber of occurrences of general-purpose communicative functions in the remaining dimen-sions is not representative in the DialogBank. In this context, we defined two evaluationscenarios. The first focuses on the recognition of the different general-purpose functionsin the segments that have communicative functions in the
Task dimension. Thus, the re-maining segments are discarded. On the other hand, the second scenario also considersthe identification of segments which have communicative functions in the
Task dimension.Thus, all segments are considered and a new label,
None , is given to those which do nothave communicative functions in that dimension.
Given the reduced amount of dialogs in the DialogBank, it is not feasible to split it intopartitions for training, development, and testing. Thus, we evaluate performance usingtwo cross-validation approaches. The first is leave-one-dialog-out cross-validation, that is,the predictions for the segments in each dialog are made by classifiers trained on all theremaining dialogs. We use this as our main evaluation approach because it maximizes theamount of gold standard data available for training.The second evaluation approach, leave-one-corpus-out cross-validation, takes advantageof the fact the DialogBank features dialogs from multiple corpora. In this case, the predic-tions for the segments in each dialog do not rely on training information from other dialogsin the same corpus. Thus, to an extent, this approach can be used to assess cross-corporageneralization capabilities.To keep the evaluation as fair as possible, contrarily to what we did when assessingthe ability of the LEGO-ISO dialogs to provide relevant information for training a commu-nicative function recognizer (Ribeiro et al., 2020), we do not perform any fine tuning tomaximize the performance on the left out dialog(s) in each fold. Instead, in each fold, wetrain an ensemble of classifiers on the corresponding training dialogs. Each of the classifiersin the ensemble is fine-tuned to maximize the performance on one of those training dialogs,while being trained on the remainder and the LEGO-ISO dialogs. This way, we removethe impact of selecting a single dialog for fine-tuning. The predicted classification of eachsegment in the left out dialog(s) is then given by a weighted majority vote of the classifiersin the ensemble. The weights are given by the estimated probability for the predicted pathand ties are broken randomly. eneral-Purpose Communicative Function Recognition The most common metric used to evaluate dialog act recognition approaches is accuracy.Its counterpart in the context of hierarchical classification problems is the exact match ratio(MR), which is defined as MR = 1 n n (cid:88) i =1 I ( Y i = Z i ) , (3)where n is the number of evaluation examples, Y i is the set of labels in the gold standardpath of example i , Z i is the set of labels in the path predicted by the classifier for the sameexample, and I is the indicator function.In addition to the exact match ratio (MR), we also report results in terms of the hierar-chical versions of precision (hP), recall (hR), and F-measure (hF) proposed by Kiritchenko,Matwin, and Famili (2005), defined ashP = (cid:80) ni =1 | Y i ∩ Z i | (cid:80) ni =1 | Z i | , (4)hR = (cid:80) ni =1 | Y i ∩ Z i | (cid:80) ni =1 | Y i | , (5)hF = 2 ∗ hP ∗ hRhP + hR . (6)These hierarchical metrics are relevant, since they consider partial path matches and, thus,capture the difference between predicting a label that shares part of its path with the correctlabel and one that follows a completely different path.In some of our experiments we use a set of additional metrics to assess the performancein each level of the hierarchy. Still, these are based on the exact match ratio and on thetraditional metrics of precision and recall. To improve readability, we report the values ofevery metric in percentage form. As discussed in Section 3, our approach to deal with the hierarchical problem posed bythe general-purpose communicative functions defined by the ISO 24617-2 standard can beapplied on top of any approach to generate segment representations. In our experiments,we relied on the same approach used in our study that explored the multiple aspects thatcontribute to dialog act recognition (Ribeiro et al., 2019b). This way, we know that thesegment representation approach is able to capture information regarding intention and wecan use the models trained during that study in transfer learning processes.Figure 4 shows the complete architecture of the network used in our experiments. Tworepresentations of the segment are generated in parallel, one based on its characters and an-other on contextualized embedding representations of its words, generated by BERT (Devlinet al., 2019). In both cases, the representation of the segment is generated by concatenatingthe outputs of three parallel CNNs with different window sizes followed by a max-poolingoperation. At the character level, we use windows of size three, five, and seven, in order to ibeiro, Ribeiro, & Martins de Matos c c c n-1 c n ...Level 1 Specialization (ReLU + Dropout) CNN (w = 3)
CNN (w = 7)
MaxPooling MaxPoolingMaxPoolingLevel 1 Output (Softmax)
Communicative Function Path CNN (w = 5) w w w n-1 w n ...CNN (w = 1) CNN (w = 3)
MaxPooling MaxPoolingMaxPoolingCNN (w = 2)
Character-Level Segment RepresentationWord-Level Segment Representation
Last 3PoolingDialog History Turn-TakingHistoryLast 3Pooling T o k en i z a t i on S eg m en t R ep r e s en t a t i on C on t e x t I n f o r m a t i on S pe c i a li z a t i on O u t pu t Level 2 Specialization (ReLU + Dropout)
Level D Specialization (ReLU + Dropout)
Output Processing (Filtering + MAP)
Level 2 Output (Softmax)
Level D Output (Softmax) ......
Figure 4: The full architecture of the automatic communicative function recognition ap-proach used in our experiments. w n and c n refer to the embedding representationof the n -th word and character, respectively. The representations of words aregenerated by BERT (Devlin et al., 2019) and, thus, are contextualized. w in theCNNs refers to the width of the context window. D refers to the depth of thehierarchy. The circles represent concatenation operations. eneral-Purpose Communicative Function Recognition focus on affixes, lemmas, and inter-word relations. At the word level, we use windows ofsize one, two, and three, in order to focus on independent words and short word patterns.The two representations are then concatenated and decorated with context information.Regarding context information, considering that the number of dialogs in the Dialog-Bank is small, we do not rely on a summary of the whole dialog history as in the originalapproach, because it is prone to overfitting. Instead, we use a flattened sequence of clas-sifications and turn-taking information of the three preceding segments, which have beenproved the most important in previous studies (Ribeiro et al., 2015; Liu et al., 2017). Theclassification of each preceding segment is represented as a concatenation of the one-hotrepresentations of the communicative functions at each level of the hierarchy. Turn-takinginformation is provided as flags stating whether the speaker changed.The specialization and output layers are as described in Section 3. That is, there arededicated specialization and output layers per each level in the hierarchy, the specializationlayer of each level also considers the output at the upper levels, and each output layerconsiders an additional class representing the lack of communicative function at that level.In order to take advantage of the intention information captured by a dialog act recogni-tion model trained on a large corpus, we apply a transfer learning process. More specifically,we preset the weights of the segment representation layers of our hierarchical model usingthe corresponding weights of the flat dialog act recognition model. Consequently, onlythe specialization and output layers are trained on the dialogs annotated according to theISO 24617-2 standard. This way, the segment representations provide generic informa-tion regarding intention that is then specialized for the distinction among general-purposecommunicative functions at each level. To implement our classifiers, we used Keras (Chollet et al., 2015) with the TensorFlow (Abadiet al., 2015) backend. To update the weights during the training phase, we used the Adamoptimizer (Kingma & Ba, 2015) with the default parameterization and mini-batches withsize 512. To decide when to stop training, we used early stopping with 10 epochs of patience.That is, the training phase of each classifier stopped after ten epochs without improvementon the validation set composed of the corresponding fine-tuning dialog(s).To obtain contextualized word representations, we used the output of the last layer of thelarge uncased BERT model (Devlin et al., 2019). When using character-level tokenization,embedding representations of the characters were trained together with the network tocapture relations between them. To generate the segment representations, we used 100filters in each CNN and aggregated the results using the max-pooling operation. Finally, thespecialization layers were implemented as Rectified Linear Units (ReLUs) (Nair & Hinton,2010) with 200 neurons and 50% dropout probability (Srivastava et al., 2014).For transfer learning purposes, we relied on the model that achieved the top performanceon the Switchboard Dialog Act Corpus in our study on dialog act recognition (Ribeiro et al.,2019b). Thus, the weights of the CNNs and the character-level embedding layer were setto those of that model and fixed during the training phase. ibeiro, Ribeiro, & Martins de Matos Evaluation Approach MR hP hR hF
Dialog Folds Flat 62.69 ± .17 82.40 ± .12 80.91 ± .15 81.65 ± .13Hierarchical ± .47 85.59 ± .16 83.20 ± .24 84.39 ± .19 Corpus Folds Flat 46.30 ± .04 72.22 ± .06 70.14 ± .67 71.17 ± .38Hierarchical ± .49 74.26 ± .25 71.53 ± .34 72.87 ± .27 Table 5: Results achieved while predicting general-purpose communicative functions in theDialogBank segments that have a communicative function in the
Task dimen-sion. The top block refers to the results achieved when evaluating using leave-one-dialog-out cross-validation, while the bottom block refers to those achieved whenevaluating using leave-one-corpus-out cross-validation.
5. Results & Discussion
In this section, we present and discuss the results of our experiments. We start by lookinginto the results achieved on segments that have communicative functions in the
Task di-mension. Then, we look into the results achieved in the scenario in which the identificationof the segments that have communicative functions in that dimension is also considered.Finally, we discuss the importance of the multiple components of the architecture and oftransfer learning processes while looking into the results of our ablation studies.
Table 5 shows the results of our experiments on the automatic recognition of general-purpose communicative functions in segments that have a communicative function in the
Task dimension. When evaluating using leave-one-dialog-out cross-validation, we can seethat, on average, the hierarchical classifier outperformed the flat one by 3.49 percentagepoints in terms of exact match ratio and 2.74 in terms of hierarchical F-measure. Thissuggests that the per-level specialization layers are able to capture the information that ismost important for distinguishing the communicative functions at a given level and thatthe output cascade is able to capture information regarding the hierarchical dependenciesbetween the communicative functions at different levels.When evaluating using leave-one-corpus-out cross-validation, we can observe similarpatterns. However, the average exact match ratio and F-measure of the best approachdecrease by 18.98 and 11.52 percentage points, respectively. In addition to the lower numberof segments used to train the classifiers, part of this drop in performance is explained by thefact that, as discussed in Section 4.1, the corpora featured in the DialogBank have differentcharacteristics and, thus, some communicative functions mostly or only occur in one ofthe corpora. Consequently, the drop is slightly higher in terms of recall (11.67 percentagepoints) than precision (11.33 percentage points). This reveals the importance of having arepresentative amount of dialogs for training, whic covers all the communicative functions.Overall, by looking at the different metrics, we can see that there is a gap of at least18.21 percentage points between the results in terms of exact match ratio and hierarchical F-measure. This shows that even when the classifiers fail to predict the correct communicative eneral-Purpose Communicative Function Recognition Level MR None% MR \ None FNone LC NoneP NoneR
L1 87.30 0.00 87.30 0.00 12.70 - -L2 80.95 0.00 80.95 0.00 19.05 - -L3 80.32 0.00 80.32 0.00 19.68 - -L4 70.75 44.90 58.12 27.76 14.12 71.69 86.25L5 90.16 90.43 26.17 65.42 8.41 93.33 96.93L6 99.82 99.82 0.00 100.00 0.00 99.82 100.00Table 6: Per-level results of one of the runs of the hierarchical approach to predict general-purpose communicative functions in the DialogBank segments that have a commu-nicative function in the
Task dimension.
None% refers to the percentage of seg-ments that have no communicative function at the corresponding level. MR \ None refers to the exact match ratio on the segments that have a communicative functionat that level.
FNone refers to the percentage of those segments which were falselyclassified as not having a communicative function, while LC refers to confusionamong labels. NoneP and
NoneR refer to the precision and recall of segmentsthat have no communicative function.function of a segment, they still predict part of the path correctly. Furthermore, the higherprecision than recall suggests that the classifiers avoid predicting communicative functionsthat are deeper in the hierarchy. To confirm this, we randomly selected one of the runs ofthe hierarchical approach and looked into the full communicative function paths predictedfor the segments. This way, we were able to assess the performance of the classifier oneach level of the hierarchy independently, as shown in Table 6. We can see that, when allsegments are considered, the lowest exact match ratio is on Level 4, which is the one withthe highest number of possible communicative functions. However, if we do not includesegments that should be classified as
None in the computation of the exact match ratio, wecan see that it decreases as the depth increases. Furthermore, in the levels in which thereare segments misclassified as
None , this kind of misclassification is predominant in relationto confusion among the actual communicative functions of the level. On the other hand,when only considering the segments that should be classified as
None , the recall is higherthan precision and increases with depth. This confirms the bias of the classifier towardsthe prediction of shallower functions. Overall, this can be justified by the reduction innumber of labeled examples with depth, which biases the deeper output layers towards theprediction of the
None label. Still, as previously discussed, the impact of this issue can beattenuated by training the approach on set of dialogs with more extensive coverage of thecommunicative functions that are deeper in the hierarchy.
Table 7 shows the results of our experiments which also considered the automatic recog-nition of the segments that have a communicative function in the
Task dimension. Wecan see that, in this scenario, the hierarchical approach still outperforms the flat one interms of hierarchical F-measure by 2.24 and 1.99 percentage points when evaluating using ibeiro, Ribeiro, & Martins de Matos Evaluation Approach MR hP hR hF
Dialog Folds Flat 73.16 ± .10 77.57 ± .03 69.14 ± .58 73.11 ± .33Hierarchical 73.13 ± .21 ± .32 ± .31 ± .31 Two-Step ± .17 ± .25 ± .15 ± .17Corpus Folds Flat 66.14 ± .16 68.83 ± .10 62.13 ± .52 65.30 ± .30Hierarchical 66.12 ± .19 ± .23 64.33 ± .17 67.29 ± .14 Two-Step ± .14 ± .20 63.44 ± .15 66.79 ± .14Table 7: Results achieved while predicting general-purpose communicative functions in theDialogBank segments, attributing the None label to those which do not havea communicative function in the
Task dimension. The top block refers to theresults achieved when evaluating using leave-one-dialog-out cross-validation, whilethe bottom block refers to those achieved when evaluating using leave-one-corpus-out cross-validation. The two-step approach applies a binary approach to decidewhether a segment has a general-purpose communicative function before applyingthe hierarchical approach.leave-one-dialog-out and leave-one-corpus-out cross-validation, respectively. However, theperformance of both approaches is similar in terms of exact match ratio. Furthermore, thewide gap between exact match ratio and hierarchical F-measure no longer exists, with amaximum difference of 2.22 percentage points. In fact, when using the flat approach, theexact match ratio surpasses the hierarchical F-measure.Overall, the exact match ratio is higher than when the segments that do not have acommunicative function in the
Task dimension are not considered because, in most cases,they are easy to identify. On the other hand, the F-measure is lower since misclassifyinga segment as not having communicative functions in the
Task dimension means that allthe functions in the correct path are missed. Additionally, since the segments withoutcommunicative functions in the
Task dimension have the
None label in every level of thehierarchy, it is expected that the classifiers become even more biased towards the predictionof shallower communicative functions.Looking at the results, we can see that the performance drop is higher in terms ofrecall than precision, which is suggestive of the bias towards the prediction of shallowercommunicative functions. To confirm it, we also looked into the per-level results of one ofthe runs of the hierarchical approach. These results are shown in Table 8. In comparisonwith the results in Table 6, we can see that the exact match ratio on segments that havea communicative function on the
Task dimension is lower and the percentage of error dueto misclassifications with the
None label is higher. On the other hand, the performanceon the detection of segments that do not have general-purpose communicative functions ateach of the levels is higher not only in terms of recall, but also in terms of precision. Whilethe former is explained by the bias of the classifiers, the latter is explained simply by theaddition of the segments with no communicative function in the
Task dimension, whichaccount for over half of the segments in the DialogBank. eneral-Purpose Communicative Function Recognition Level MR None% MR \ None FNone LC NoneP NoneR
L0 85.85 52.63 83.45 16.55 0.00 85.52 88.00L1 80.97 52.63 72.36 17.44 10.20 84.97 88.73L2 78.86 52.63 67.62 17.80 14.58 84.74 88.97L3 78.77 52.63 67.35 17.89 14.76 84.69 89.05L4 80.93 73.90 41.72 51.30 6.98 83.95 94.78L5 94.87 95.47 18.69 74.77 6.54 96.52 98.49L6 99.92 99.92 0.00 100.00 0.00 99.92 100.00Table 8: Per-level results of one of the runs of the hierarchical approach to predict general-purpose communicative functions in all DialogBank segments.
None% refers tothe percentage of segments that have no communicative function at the corre-sponding level. MR \ None refers to the exact match ratio on the segments thathave a communicative function at that level.
FNone refers to the percentage ofthose segments which were falsely classified as not having a communicative func-tion, while LC refers to confusion among labels. NoneP and
NoneR refer to theprecision and recall of segments that have no communicative function.As discussed in Section 5.1, we believe that the bias towards the prediction of shallowercommunicative functions can be attenuated by training the hierarchical approach on a suffi-ciently large amount of dialogs with representative coverage of the communicative functionsthat are deeper in the hierarchy. Still, in an attempt to attenuate the bias without relyingon additional annotated data, we explored the use of a two-step approach which uses abinary classifier to identify the segments which have a communicative function in the
Task dimension before applying the hierarchical approach to those segments.In Table 7, we can see that the two-step approach achieves the highest performance interms of exact match ratio. However, it is outperformed by the hierarchical approach interms of hierarchical F-measure. The average performance of the binary classifier in terms ofexact match ratio is of 85.52 percentage points when evaluating using leave-one-dialog-outcross validation and of 84.05 percentage points when evaluating using leave-one-corpus-outcross-validation. These values are in line with those achieved by the hierarchical approachon the top level. Thus, the differences between the two approaches can be explained bytwo factors. On the one hand, the hierarchical part of the two-step approach is trainedsolely on the segments that have communicative functions in the
Task dimension. Thus,it is less biased towards the prediction of the
None label in every level, which improvesthe performance in terms of exact match ratio on those segments. On the other hand, anincorrect decision of the binary classifier cannot be corrected by the predictions on the lowerlevels using MAP prediction. Thus, these misclassifications at the top level have a moreprominent impact on the performance of the approach in terms of hierarchical F-measure.When evaluating using leave-one-dialog-out cross-validation, the difference between thehierarchical and two-step approaches is of just 0.22 and 0.15 percentage points in termsof average exact match ratio and hierarchical F-measure, respectively. This suggests thatthe overall performance of both approaches is actually similar. Still, when evaluating using ibeiro, Ribeiro, & Martins de Matos Approach MR hP hR hF
Full 66.18 ± .47 85.59 ± .16 83.20 ± .24 84.39 ± .19- Cascading Outputs 62.85 ± .08 83.79 ± .06 80.80 ± .07 82.27 ± .06- Specialization Layers 60.58 ± .17 82.65 ± .10 79.82 ± .05 81.21 ± .07- MAP Prediction 63.83 ± .18 84.63 ± .19 81.82 ± .14 83.20 ± .17- Transfer Learning 59.48 ± .19 83.75 ± .09 80.35 ± .11 82.00 ± .09- LEGO Dialogs ± .32 88.29 ± .02 85.24 ± .02 86.73 ± .01 Table 9: Results of the ablation studies. The first block shows the performance of thefull approach for comparison. The second block shows the results achieved whenone component of the architecture is removed. The last block shows the resultsachieved without including information from external data.leave-one-corpus-out cross-validation, the differences between the two approaches are morenoticeable, with a 0.70 percentage-point difference in favor of the two-step approach interms of exact match ratio and a 0.50 percentage-point difference in favor of the hierarchicalapproach in terms of hierarchical F-measure. Thus, the selection of the best approach forthe automatic recognition of general-purpose communicative functions is dependent on themetric that is most relevant for subsequent tasks.
In order to assess the importance of the multiple components of our hierarchical approachto the automatic recognition of ISO 24617-2 general-purpose communicative functions, weperformed a set of ablation studies, in which one of the components was removed andthe performance was compared with that of the full approach. These studies can be splitinto two categories: those regarding the architecture of the approach and those regardingdata and transfer learning aspects. The experiments were performed in the scenario thattargets the recognition of general-purpose communicative functions in segments that have acommunicative function in the
Task dimension and were evaluated using leave-one-dialog-out cross-validation. Table 9 shows the results of these experiments, with the first blockshowing the performance of the full approach for comparison.The second block of Table 9 shows the results achieved when one of the componentsof the architecture is removed. We can see that the performance is negatively impacted inevery case. By removing the connections between the output at each level and those belowit, the average performance decreased by 3.33 percentage points in terms of exact matchratio and 2.12 percentage points in terms of hierarchical F-measure. Furthermore, whilethe performance drops in terms of both precision and recall, the drop is higher in terms ofthe latter. This suggests that the cascading outputs help attenuating the bias towards theprediction of shallower communicative functions.By removing the specialization layers, the impact on performance is higher than that ofremoving the cascading outputs. More specifically, the average performance drops 5.60, 2.94,3.38, and 3.18 percentage points in terms of exact match ratio and hierarchical precision, eneral-Purpose Communicative Function Recognition recall, and F-measure, respectively. This means that the specialization layers are able tocapture the information present in the representation of the segment decorated with contextinformation that is most relevant to distinguish among the communicative functions at thecorresponding level of the hierarchy.Finally, by replacing the MAP prediction approach with a top-down iterative one thatlimits the choices at each level of the hierarchy by only considering the communicativefunctions that are children of that selected for the level above it, the performance decreased2.35 percentage points in terms of exact match ratio and 1.19 in terms of F-measure. Thisconfirms that the MAP prediction approach can attenuate the impact of misclassificationsin the top levels by relying on the distributional outputs obtained using softmax to obtaina joint prediction for all levels of the hierarchy.The last block in Table 9 shows the results achieved without pre-training the segmentrepresentation layers for dialog act recognition on the Switchboard Dialog Act Corpus andwithout considering the dialogs of the LEGO corpus during the training phase. First ofall, we can see that using transfer learning improves the performance in terms of everymetric, especially exact match ratio, which improves by 6.70 percentage points. This showsthat, given the reduced amount of data, training the layers that generate the segmentrepresentations solely on the dialogs annotated according to the ISO 24617-2 standard leadsto overfitting. On the other hand, since the Switchboard Dialog Act Corpus is sufficientlylarge, a model trained on its dialogs generates representations that capture informationregarding generic intention that is not specific to a single set of labels. The specificitiesof different sets are then captured by the specialization and output layers, leading to thegeneration of models that have higher generalization potential.On the other hand, the performance increased when the LEGO dialogs were not con-sidered during the training phase, leading to an average exact match ratio of 70.21% andan average hierarchical F-measure of 86.73%. This is contrary to what we first observedwhen studying the mapping of the original dialog labels of the LEGO corpus into the com-municative functions defined by the ISO 24617-2 standard and the use of those dialogs tohelp in the recognition of the communicative functions in the DialogBank (Ribeiro et al.,2020). This can be explained by several factors. First, on that study, we did not rely onBERT word embeddings nor on transfer learning processes to obtain segment representa-tions. Thus, in that case, the LEGO dialogs were important for generating more genericsegment representations and avoiding overfitting. On the other hand, when those dialogsonly contribute for the training of the specialization and output layers as in this study,they have a negative impact because the LEGO dialogs have different characteristics thanthose in DialogBank, especially in terms of the user utterances. Second, considering thatthe hierarchical approach has one pair of specialization and output layers per level, it hasmore trainable parameters than the flat one. Consequently, it is more prone to overfitting.That is relevant in this context since the LEGO dialogs are in larger number, they arehighly repetitive, and are mostly covered by a small set of general-purpose communicativefunctions. Last, we were not focusing on the generalization ability of the classifiers in thatstudy. Thus, the cross-validation process fine-tuned the classifiers to achieve the highestperformance on the test dialog of the corresponding fold. Given this fine-tuning of theclassifiers, they were less prone to overfit to the specificities of the LEGO dialogs. ibeiro, Ribeiro, & Martins de Matos
6. Conclusions
In this article, we have explored the automatic recognition of the general-purpose commu-nicative functions defined by the ISO 24617-2 standard for dialog act annotation. To do so,we proposed modifications to existing approaches to flat dialog act recognition that allowthem to deal with the hierarchical classification problem posed by these communicativefunctions. Experiments on the DialogBank, which is a reference set of dialogs annotatedaccording to the standard, have shown that our hierarchical approach outperforms a flatapproach similar to those used on most dialog act recognition tasks, both in terms of exactmatch ratio and hierarchical F-measure.Addressing the modifications more specifically, instead of a single output layer, ourapproach includes one output layer per level of the hierarchy. This allows it to focus ondistinguishing among communicative functions that are at the same level without havingto deal with the ambiguities caused by communicative functions that are ancestors or de-scendants of each other. Furthermore, it also includes one specialization layer per level,which captures the information provided by the generic segment representation decoratedwith context information that is most relevant for the corresponding level.Since the segments in a dialog may have a communicative function that is not a leaf ofthe hierarchy, we included an additional label in each level of the hierarchy, which refersto the absence of label at that level and those under it, allowing the prediction of pathswith variable length. Furthermore, in order to avoid predicting invalid communicative func-tion paths, our approach relies on a prediction approach based on MAP estimation, whichconsiders the parent-child relations between the communicative functions and improves ro-bustness to misclassifications in individual levels.Finally, since the DialogBank only features a small set of dialogs, we relied on transferlearning processes to generate more generic segment representations. More specifically, thelayers that generate the representations were pre-trained on the largest corpus annotatedfor dialog acts. The generic representations are then fine-tuned for the distinction amongcommunicative functions by the specialization layers. This way, the classifier is less proneto overfit to the characteristics of specific DialogBank dialogs, leading to improved perfor-mance. We also explored the use additional training dialogs, annotated using label mappingprocesses. However, they harmed performance when paired with the pre-trained segmentrepresentation layers. This happened because the specialization layers became overfit tothe characteristics of the additional dialogs, which are in larger number than those in theDialogBank and deal with a specific domain.To the best of our knowledge, this was the first study to focus on the automatic recog-nition of the complete hierarchy of general-purpose communicative functions defined by theISO 24617-2 standard. Still, it focused on devising an approach that is appropriate to dealwith the hierarchical classification problem posed by the communicative functions. Thus,as future work, it would be interesting to compare the different segment and context in-formation representation approaches used in dialog act recognition studies to identify themost appropriate for this task. Furthermore, in addition to the general-purpose commu-nicative functions, the ISO 24617-2 standard also defines dimension-specific communicativefunctions and a complete dialog act annotation includes additional information. Thus, theautomatic recognition of all the relevant aspects should also be addressed as future work. eneral-Purpose Communicative Function Recognition However, that requires a representative amount of annotated data, which the DialogBankdoes not possess. Consequently, additional efforts should be made to increase the numberof publicly available dialogs fully annotated according to the standard.
Acknowledgements
Eug´enio Ribeiro is supported by a PhD scholarship granted by Funda¸c˜ao para a Ciˆencia ea Tecnologia (FCT), with reference SFRH/BD/148142/2019. Additionally, this work wassupported by Portuguese national funds through FCT, with reference UIDB/50021/2020.
References
Language and Speech , (4), 351–366.Anikina, T., & Kruijff-Korbayov´a, I. (2019). Dialogue Act Classification in Team Commu-nication for Robot Assisted Disaster Response. In SIGDIAL , pp. 399–410.Bened´ı, J.-M., Lleida, E., Varona, A., Castro, M.-J., Galiano, I., Justo, R., de Letona, I. L.,& Miguel, A. (2006). Design and Acquisition of a Telephone Spontaneous SpeechDialogue Corpus in Spanish: DIHANA. In
LREC , pp. 1636–1639.Bothe, C., Weber, C., Magg, S., & Wermter, S. (2018). A Context-based Approach forDialogue Act Recognition using Simple Recurrent Neural Networks. In
LREC , pp.1952–1957.Bunt, H., Alexandersson, J., Choe, J.-W., Fang, A. C., Hasida, K., Petukhova, V., Popescu-Belis, A., & Traum, D. (2012). ISO 24617-2: A Semantically-Based Standard forDialogue Annotation. In
LREC , pp. 430–437.Bunt, H., Petukhova, V., Malchanau, A., Fang, A., & Wijnhoven, K. (2019). The Dialog-Bank: Dialogues with Interoperable Annotations.
Language Resources and Evaluation , (2), 213–249.Bunt, H., Petukhova, V., Malchanau, A., Wijnhoven, K., & Fang, A. (2016). The Dialog-Bank. In LREC , pp. 3151–3158.Bunt, H., Petukhova, V., Traum, D., & Alexandersson, J. (2017). Dialogue Act Annotationwith the ISO 24617-2 Standard. In
Multimodal Interaction with W3C Standards , pp.109–135. Springer.Carroll, J. M., & Tanenhaus, M. K. (1978). Functional Clauses and Sentence Segmentation.
Journal of Speech, Language, and Hearing Research , (4), 793–808. ibeiro, Ribeiro, & Martins de Matos Chen, Z., Yang, R., Zhao, Z., Cai, D., & He, X. (2018). Dialogue Act Recognition viaCRF-Attentive Structured Network. In
SIGIR , pp. 225–234.Chollet, F., et al. (2015). Keras: The Python Deep Learning Library. https://keras.io/.Devlin, J., Chang, M.-W., Kenton, L., & Toutanova, K. (2019). BERT: Pre-training of DeepBidirectional Transformers for Language Understanding. In
NAACL-HLT , Vol. 1, pp.4171–4186.Fang, A., Cao, J., Bunt, H., & Liu, X. (2012). The Annotation of the Switchboard Corpuswith the New ISO Standard for Dialogue Act Analysis. In
Workshop on InteroperableSemantic Annotation , pp. 13–18.Gamb¨ack, B., Olsson, F., & T¨ackstr¨om, O. (2011). Active Learning for Dialogue Act Clas-sification. In
INTERSPEECH , pp. 1329–1332.Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone SpeechCorpus for Research and Development. In
ICASSP , Vol. 1, pp. 517–520.Ji, Y., Haffari, G., & Eisenstein, J. (2016). A Latent Variable Recurrent Neural Networkfor Discourse Relation Language Models. In
NAACL-HLT , Vol. 1, pp. 332–342.Jurafsky, D., Shriberg, E., & Biasca, D. (1997). Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation Coder Manual. Tech. rep. Draft 13, Institute of Cog-nitive Science, University of Colorado.Kalchbrenner, N., & Blunsom, P. (2013). Recurrent Convolutional Neural Networks forDiscourse Compositionality. In
Workshop on Continuous Vector Space Models andtheir Compositionality , pp. 119–126.Khanpour, H., Guntakandla, N., & Nielsen, R. (2016). Dialogue Act Classification inDomain-Independent Conversations Using a Deep Recurrent Neural Network. In
COLING , pp. 2012–2021.Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In
ICLR .Kiritchenko, S., Matwin, S., & Famili, F. (2005). Functional Annotation of Genes usingHierarchical Text Categorization. In
BioLINK SIG .Kr´al, P., & Cerisara, C. (2010). Dialogue Act Recognition Approaches.
Computing andInformatics , (2), 227–250.Kruijff-Korbayov´a, I., Colas, F., Gianni, M., Pirri, F., de Greeff, J., Hindriks, K., Neerincx,M., ¨Ogren, P., Svoboda, T., & Worst, R. (2015). TRADR Project: Long-Term Human-Robot Teaming for Robot Assisted Disaster Response. KI - K¨unstliche Intelligenz , (2), 193–201.Kumar, H., Agarwal, A., Dasgupta, R., & Joshi, S. (2018). Dialogue Act Sequence LabelingUsing Hierarchical Encoder With CRF. In AAAI , pp. 3440–3447.Lee, J. Y., & Dernoncourt, F. (2016). Sequential Short-Text Classification with Recurrentand Convolutional Neural Networks. In
NAACL-HLT , Vol. 2, pp. 515–520.Li, R., Lin, C., Collinson, M., Li, X., & Chen, G. (2019). A Dual-Attention HierarchicalRecurrent Neural Network for Dialogue Act Classification. In
CoNLL , pp. 383–392. eneral-Purpose Communicative Function Recognition Liu, Y., Han, K., Tan, Z., & Lei, Y. (2017). Using Context Information for Dialog ActClassification in DNN Framework. In
EMNLP , pp. 2160–2168.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Repre-sentations of Words and Phrases and their Compositionality. In
NIPS , pp. 3111–3119.Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted BoltzmannMachines. In
ICML , pp. 807–814.Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for WordRepresentation. In
EMNLP , pp. 1532–1543.Petukhova, V., Gropp, M., Klakow, D., Schmidt, A., Eigner, G., Topf, M., Srb, S., Motlicek,P., Potard, B., Dines, J., Deroo, O., Egeler, R., Meinz, U., & Liersch, S. (2014). TheDBOX Corpus Collection of Spoken Human-Human and Human-Machine Dialogues.In
LREC , pp. 252–258.Raheja, V., & Tetreault, J. (2019). Dialogue Act Classification with Context-Aware Self-Attention. In
NAACL , pp. 3727–3733.Raux, A., Bohus, D., Langner, B., Black, A. W., & Eskenazi, M. (2006). Doing Researchon a Deployed Spoken Dialogue System: One Year of Let’s Go! Experience. In
IN-TERSPEECH , pp. 65–68.Ravi, S., & Kozareva, Z. (2018). Self-Governing Neural Networks for On-Device Short TextClassification. In
EMNLP , pp. 804–810.Ribeiro, E., Ribeiro, R., & Martins de Matos, D. (2015). The Influence of Context onDialogue Act Recognition.
Computing Research Repository , arXiv:1506.00839 .Ribeiro, E., Ribeiro, R., & Martins de Matos, D. (2018a). A Study on Dialog Act Recog-nition using Character-Level Tokenization. In AIMSA , pp. 93–103.Ribeiro, E., Ribeiro, R., & Martins de Matos, D. (2018b). End-to-End Multi-Level DialogAct Recognition. In
IberSPEECH , pp. 301–305.Ribeiro, E., Ribeiro, R., & Martins de Matos, D. (2019a). A Multilingual and MultidomainStudy on Dialog Act Recognition using Character-Level Tokenization.
Information , (3), 94.Ribeiro, E., Ribeiro, R., & Martins de Matos, D. (2019b). Deep Dialog Act Recognitionusing Multiple Token, Segment, and Context Information Representations. Journalof Artificial Intelligence Research , , 861–899.Ribeiro, E., Ribeiro, R., & Martins de Matos, D. (2020). Mapping the Dialog Act Anno-tations of the LEGO Corpus into ISO 24617-2 Communicative Functions. In LREC ,pp. 531–539.Schmitt, A., Ultes, S., & Minker, W. (2012). A Parameterized and Annotated Spoken DialogCorpus of the CMU Let’s Go Bus Information System. In
LREC , pp. 3369–3373.Searle, J. R. (1969).
Speech Acts: An Essay in the Philosophy of Language . CambridgeUniversity Press.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014).Dropout: a Simple Way to Prevent Neural Networks from Overfitting.
Journal ofMachine Learning Research , (1), 1929–1958. ibeiro, Ribeiro, & Martins de Matos Tran, Q. H., Zukerman, I., & Haffari, G. (2017a). A Generative Attentional Neural NetworkModel for Dialogue Act Classification. In
ACL , Vol. 2, pp. 524–529.Tran, Q. H., Zukerman, I., & Haffari, G. (2017b). A Hierarchical Neural Model for LearningSequences of Dialogue Acts. In
EACL , Vol. 1, pp. 428–437.Tran, Q. H., Zukerman, I., & Haffari, G. (2017c). Preserving Distributional Information inDialogue Act Classification. In
EMNLP , pp. 2151–2156.Wan, Y., Yan, W., Gao, J., Zhao, Z., Wu, J., & Yu, P. S. (2018). Improved DynamicMemory Network for Dialogue Act Classification with Adversarial Training. In
IEEEInternational Conference on Big Data , pp. 841–850., pp. 841–850.