[PDF] Dialogue Response Selection with Hierarchical Curriculum Learning

Abstract

We study the learning of a matching model for dialogue response selection. Motivated by the recent finding that models trained with random negative samples are not ideal in real-world scenarios, we propose a hierarchical curriculum learning framework that trains the matching model in an "easy-to-difficult" scheme. Our learning framework consists of two complementary curricula: (1) corpus-level curriculum (CC); and (2) instance-level curriculum (IC). In CC, the model gradually increases its ability in finding the matching clues between the dialogue context and a response candidate. As for IC, it progressively strengthens the model's ability in identifying the mismatching information between the dialogue context and a response candidate. Empirical studies on three benchmark datasets with three state-of-the-art matching models demonstrate that the proposed learning framework significantly improves the model performance across various evaluation metrics.

Full PDF

DDialogue Response Selection with Hierarchical Curriculum Learning

Yixuan Su ♦ , ∗ Deng Cai ♥ Qingyu Zhou ♠ Zibo Lin ♣ Simon Baker ♦ Yunbo Cao ♠ Shuming Shi ♠ Nigel Collier ♦ Yan Wang ♠♦ University of Cambridge ♥ The Chinese University of Hong Kong ♣ Tsinghua University ♠ Tencent Inc. [email protected]

Abstract

We study the learning of a matching modelfor dialogue response selection. Motivatedby the recent ﬁnding that random negativesare often too trivial to train a reliable model,we propose a hierarchical curriculum learning(HCL) framework that consists of two comple-mentary curricula: (1) corpus-level curriculum(CC); and (2) instance-level curriculum (IC).In CC, the model gradually increases its abil-ity in ﬁnding the matching clues between thedialogue context and a response. On the otherhand, IC progressively strengthens the model’sability in identifying the mismatched informa-tion between the dialogue context and a re-sponse. Empirical studies on two benchmarkdatasets with three state-of-the-art matchingmodels demonstrate that the proposed HCLsigniﬁcantly improves the model performanceacross various evaluation metrics . Building intelligent conversation systems is a long-standing goal of artiﬁcial intelligence and has at-tracted much attention in recent years (Shum et al.,2018; Kollar et al., 2018). A central challenge forbuilding such conversation systems is the responseselection problem, that is, selecting the best re-sponse to a given dialogue context from a pool ofcandidate responses (Ritter et al., 2011).To tackle the response selection problem, differ-ent matching models are developed to measure thematching degree between a conversation contextand a response candidate (Wu et al., 2017; Zhouet al., 2018; Lu et al., 2019; Gu et al., 2019). De-spite their differences, most prior works train thematching models with training data constructed bya simple heuristic. For each dialogue context, the ∗ Work was done during internship at Tencent Cloud Xiaoweiand Tencent AI Lab. All data, code and models are made publicly available athttps://github.com/yxuansu/HCL/.

Dialogue Context Between Two Speakers A and B

A: Would you please recommend me a good TV seriesto watch during my spare time?B: Absolutely! Which kind of TV series are you mostinterested in?A: My favorite type is fantasy drama.B: I think both Game of Thrones and The VampireDiaries are good choices.

Positive Response

P1: Awesome, I believe both of them are great TVseries! I will ﬁrst watch Game of Thrones. (

Easy )P2: Cool! I think I ﬁnd the perfect things to kill myweekends. (

Difﬁcult ) Negative Response

N1: This restaurant is very expensive. (

Easy )N2: Iain Glen played Ser Jorah Mormont in the HBOfantasy series Game of Thrones. (

Difﬁcult ) Table 1: An example dialogue context between speak-ers A and B, where P1 and P2 are easy and difﬁcultpositives; N1 and N2 are easy and difﬁcult negatives. human-written response is considered as positive(i.e., an adequate response) and the responses fromother dialogue contexts are considered as negative(i.e., inappropriate responses). In practice, the neg-ative responses are often randomly sampled andthe training objective is to ensure that the positiveresponses score higher than the negative ones.Recently, some researchers (Li et al., 2019; Linet al., 2020) has raised the concern that randomlysampled negative responses are often too trivial(i.e., totally irrelevant to the dialogue context).Models trained with such negative data lacks theability to handle strong distractors during testing.In general, the problem stems from the ignoranceof the diversity in context-response matching; allrandom responses are treated as equally negativeregardless of their distracting strength. For exam-ple, in Table 1, two negative responses (N1, N2)are presented. For N1, one can easily dispel itslegality as it does not follow the topic discussed a r X i v : . [ c s . C L ] D ec n the dialogue context. On the other hand, judg-ing a strong distractor like N2 can be difﬁcult asits content overlaps signiﬁcantly with the context(e.g., both mention fantasy series and Game ofThrones ). Only with close observation, we ﬁnd thatN2 does not strongly maintain the coherence of thediscussion, i.e., it starts a parallel discussion aboutan actor in Game of Thrones rather than elaborat-ing on the enjoyable properties of the TV series.Similarly, the positive side has the same phenom-ena. For the positive response P1, one can easilyconﬁrm its legality as it naturally replies the con-text. As for P2, while it expatiates on the enjoyableproperties of the TV series, it doesn’t exhibit anyobvious matching clues, such as lexical overlapwith the context. Thus, to correctly identify P2,the relationship between P2 and the context has tobe carefully reasoned by the model. To conclude,the above observations suggest that, to accuratelyrecognize different positive and negative responses,the model is required to possess different levels ofdiscriminative capability.Inspired by the aforementioned observations, wepropose to employ the idea of curriculum learn-ing (CL) (Bengio et al., 2009) for a better learningof response selection models. CL is reminiscentof the cognitive process of human being, the coreidea is ﬁrst learning easier concepts and then grad-ually transitioning to learning more complex con-cepts based on some pre-deﬁned learning schemes.In various NLP tasks (e.g., dependency parsing(Spitkovsky et al., 2010), natural answer generation(Liu et al., 2018), and machine translation (Platan-ios et al., 2019))), CL has demonstrated its beneﬁtin improving the model performance as well as thelearning convergence.The key to applying CL is to specify an appropri-ate learning scheme under which all training exam-ples are gradually learned (Saxena et al., 2019). Inthis work, we tailor-design a hierarchical curricu-lum learning (HCL) framework according to thecharacteristics of the concerned response selectiontask. Our HCL framework consists of two com-plementary curriculum strategies, namely corpus-level curriculum (CC) and instance-level curricu-lum (IC), covering the two distinct aspects of re-sponse selection. Speciﬁcally, in CC, the modelgradually increases its ability in ﬁnding matchingclues between the context and the positive response.As for IC, it progressively strengthens the model’sability in identifying the mismatch information be- tween the context and negative responses. To or-der all positive and negative examples, we need toassess millions of possible context-response com-binations in the training data. To overcome thiscomputational challenge, we propose to use a fastneural ranking model to assign learning prioritiesto all training examples based on their pairwisecontext-response similarity score.Notably, our proposed learning framework is in-dependent to the choice of matching models. There-fore, for a comprehensive evaluation, we test ourapproach with three representative matching mod-els, including the latest advance brought by pre-trained language models. Results on two bench-mark datasets demonstrate that the proposed learn-ing framework leads to remarkable performanceimprovement across all evaluation metrics.In summary, our contributions are: (1) We pro-pose a new hierarchical curriculum learning frame-work to tackle the task of response selection; and(2) Experimental results on two benchmark datasetsdemonstrate that our approach can signiﬁcantly im-prove the performance of strong matching models,including the state-of-the-art one.

Given a dataset D = { ( c i , r + i ) } |D| i =1 , the task of re-sponse selection is to learn a matching model s ( · , · ) that correctly identiﬁes the positive response r + i conditioned on the dialogue context c i from a setof negative responses R − i . Typically, the learningof s ( · , · ) is to optimize the following objective L s = m (cid:88) j =1 max { , − s ( c i , r + i )+ s ( c i , R − ij ) } , (1)where m is the number of negative responses foreach training instance ( c i , r + i ) .In most existing studies (Wu et al., 2017; Zhouet al., 2018; Lu et al., 2019; Gu et al., 2019), thetraining negatives R − are acquired using randomselection. However, distinguishing the positive re-sponse from such randomly sampled negatives of-ten leads to sub-optimal model performance (Wuet al., 2018). To alleviate this problem, Li et al.(2019) and Lin et al. (2020) proposed different ap-proaches to strengthen the training negatives andachieve better results.Different from previous works, we argue that thelearning of a matching model should involve twoaspects. Speciﬁcally, given a dialogue context, the igure 1: An illustration of the proposed approach: On the left part, two training context-response pairs withdifferent difﬁculty level are presented (the upper one is more difﬁcult than the lower one, and P denotes thepositive response). For each training instance, we show three associated negative responses ( N1 , N2 and N3 )whose difﬁculty level increase from the bottom to the top. In the negative examples, the words that also appear inthe dialogue context are marked as italic . model should learn to (1) ﬁnd matching clues con-tained in the positive response; and (2) identify themismatching information contained in the negativeresponses. In addition, the learning in these two as-pects should follow an “easy-to-difﬁcult” process.To this end, we employ the idea of curriculum learn-ing and introduce a new learning framework whichgradually strengthens the model’s ability in the twoaforementioned aspects. We propose hierarchical curriculum learning (HCL), a new framework for training neural match-ing models. It consists of two complementarycurriculum strategies: (1) corpus-level curriculum(CC); and (2) instance-level curriculum (IC). Fig-ure 1 illustrates the relationship of these strategies.In CC, easier context-response pairs are presentedto the model before harder ones. In this way, themodel gradually increases its ability in ﬁnding thematching clues, such as lexical overlap, that existin the dialogue context and the positive response.As for IC, it controls the difﬁculty of negative re-sponses that associated to each training context-response pair. Starting from easier negatives, themodel progressively strengthens its ability in iden-tifying the mismatch information (e.g., semanticincoherence) between the context and negative re-sponses. In the rest of this section, we give detailed descriptions of the proposed approach.

Given the dataset D = { ( c i , r + i ) } |D| i =1 , the corpus-level curriculum arranges the ordering of differ-ent training context-response pairs. The modelﬁrst learns to ﬁnd easier matching clues from thecontext-response pairs with lower difﬁculty. As thetraining evolves, harder cases are presented to themodel and it then learns to ﬁnd less obvious match-ing signals. Two examples are shown in the left partof Figure 1. For the easier pair, the context and thepositive response are lexically overlapped (e.g., TVseries and

Game of Thrones ) with each other andsuch matching clue is simple for the model to learn.As for the harder case, the positive response canonly be identiﬁed via numerical reasoning, whichmakes it harder to learn.

Difﬁculty Function

To measure the difﬁculty ofeach training context-response pair ( c, r ) , we adopta pre-trained ranking model G ( · , · ) (details are pre-sented in §3.4) to calculate its similarity score as G ( c, r ) . Here, a higher score of G ( c, r ) corre-sponds to a higher similarity between c and r andvice versa. Then, for each pair ( c i , r + i ) ∈ D , itscorpus-level difﬁculty is deﬁned as f d ( c i , r + i ) = 1 . − G ( c i , r + i )max ( c k ,r + k ) ∈D G ( c k , r + k ) , (2) igure 2: (a) An illustration of the corpus-level curriculum. At each training step: (1) f p ( t ) is computed basedon the current step t ; and (2) a batch of context-response pairs are uniformly sampled from the training instanceswhose corpus-level difﬁculty is lower than f p ( t ) (shaded area in the example). In this example, T = 2000 and T = 8000 ; (b): An illustration of the instance-level pacing function, in this case, k = 6 , k T = 3 and T = 8000 . where f d ( c i , r + i ) ∈ [0 . , . . A lower difﬁcultyscore indicates c + i and r i are more similar to eachother thus are easier for the model to learn. Pacing Function

During training, to select thetraining instances with desired difﬁculty, we resortto a pre-deﬁned corpus-level pacing function, f p ( t ) .Speciﬁcally, f p ( t ) is deﬁned as a function of train-ing steps. At each time step t , the model is onlyallowed to use the training instances ( c, r + ) whosecorpus-level difﬁculty score f d ( c, r + ) is lower than f p ( t ) . Starting from easier data instances, themodel gradually learns harder cases as the trainingevolves. In this work, we propose a simple func-tional form for f p ( t ) as shown in the following : f p ( t ) =  r if t ≤ T , . − r T − T · ( t − T ) + r if T ≤ t ≤ T, . otherwise . At the warm up stage of training (ﬁrst T steps),we learn a basic matching model with the easiestpart of the training set. Then, the model is allowedto gradually use harder instances. After f p ( t ) be-comes . (at time step T ), the corpus-level curricu-lum is completed and the model can freely accessthe entire dataset. Figure 2(a) depicts an illustrationof the proposed corpus-level curriculum. The instance-level curriculum (IC) controls the dif-ﬁculty of negative examples associated with eachtraining context-response pair. At the start of train-ing, the model learns to contrast the positive re-sponse with easy negatives. As the training evolves, More sophisticated designs for the function f p ( t ) are possi-ble, but we do not consider them in this work. IC gradually increases the difﬁculty of negative ex-amples to progressively strengthen the model’s abil-ity in ﬁnding mismatched information. A concreteexample is shown in the right part of Figure 1, fromwhich we can see that the easy negatives are alwayssimple to spot as they are often obviously off thetopic. On the other hand, harder negatives mightshare lexical overlap with the context ( italic wordsin Figure 1), thus the model is required to identifythe ﬁne-grained semantic incoherence between thecontext and negative examples. In the following,we show how to measure the difﬁculty of negativeexamples for different training instances and howto dynamically select them based on the learningstate.

Difﬁculty Function

Given a speciﬁc training in-stance ( c, r + ) , the instance-level difﬁculty of anarbitrary response ¯ r ∈ D is deﬁned as h d ( c, ¯ r ) = rank ( G ( c, ¯ r ) , D ) . (3)To compute the function h d ( c, · ) , we ﬁrst sortall responses r ∈ D using the similarity score G ( c, r ) computed by the neural ranking model(§3.4). Then, for each response ¯ r , h d ( c, ¯ r ) returnsits sorted rank (e.g., for all responses contained in D , the one that is most similar to c has a rank of and the most dissimilar one has a rank |D| ). Pacing Function

To dynamically adjust the dif-ﬁculty of negative examples, we resort to apre-deﬁned instance-level pacing function, h p ( t ) .Speciﬁcally, h p ( t ) controls the size of the samplingspace (in log-scale) from which the negative exam-ples are selected as h p ( t ) = (cid:40) − ( k − k T ) T · ( t − T ) + k T if t ≤ T,k T if t > T, lgorithm 1: Hierarchical CurriculumLearning Algorithm

Input :

Dataset, D = { ( c i , r + i ) } |D| i =1 ; model trainer, T , that takes batches of training data asinput to optimize the model; corpus-leveldifﬁculty and pacing function, f d and f p ;instance-level difﬁculty function and pacingfunction, h d and h p ; number of negativeresponses, m ; for train step t = 1 , ... do Uniformly sample one batch of context-responsepairs, B t , from all ( c i , r + i ) ∈ D , such that f d ( c i , r + i ) ≤ f p ( t ) , as shown in Figure 2(a). for ( c j , r + j ) in B t do Uniformly sample m negative responses, R − j , from all responses ¯ r that satisﬁes thecondition h d ( c j , ¯ r ) ≤ h p ( t ) . end Invoke the trainer, T , using { ( c k , r + k , R − k ) } | B t | k =1 as input to optimize the model using Eq. (1). endOutput : Trained Model where k = log |D| . For each training instance ( c, r + ) , when selecting the negative examples, weﬁrst compute the sampling space size k as h p ( t ) .Next, we uniformly sample a set of negative ex-amples from the top- k similar responses to c thatsatisfy the condition: h d ( c, ¯ r ) ≤ k . For a betterillustration, we depict an example of h p ( t ) in Fig-ure 2(b). In this case, at the start of training, thenegative examples are randomly sampled from theentire dataset D ( |D| = 10 ). Then, we graduallyincrease the difﬁculty of the negative examples byconstraining the sampling size k ( k is ﬁxed as after steps). We provide more discussions inthe result section. The proposed learn-ing framework simultaneously employs the corpus-level (CC) and instance-level (IC) curriculum strate-gies. To efﬁciently exert the proposed approach,we ﬁrst use a fast ranking model to pre-computethe similarity score G ( c i , r j ) between any arbitrarycontexts c i and responses r j . During the learningof matching model, in each batch, we ﬁrst selectthe positive samples according to the pacing func-tion f p ( t ) in CC. Then, for each positive sample inthe selected batch, we select its associated negativesamples according to the pacing function h p ( t ) inIC. Detailed descriptions about how HCL worksare shown in Algorithm 1. Fast Ranking Model

As described in Eq. (2)and (3), our framework requires a ranking model G ( · , · ) that efﬁciently measures the pairwise simi-larity of millions of possible context-response com-binations. To this end, we construct the rankingmodel based on a bi-encoder structure. Speciﬁcally,for an arbitrary pair of context c and response r ,their pairwise similarity G ( c, r ) is deﬁned as G ( c, r ) = E c ( c ) T E r ( r ) , (4)where E c ( c ) and E r ( r ) are dense context and re-sponse representations produced by a context en-coder E c ( · ) and a response encoder E r ( · ) . In thispaper, we use Transformers (Vaswani et al., 2017)to build the encoder E c ( · ) and E r ( · ) .We ﬁrst train the ranking model G ( · , · ) on thesame response selection dataset D using the in-batch negative objective (Karpukhin et al., 2020).Next, we compute the dense representations of allcontexts and responses contained in D . Then, wecalculate the similarity scores of all possible com-binations of contexts and responses in D by takingthe dot product between their representations as de-scribed in Eq. (4). After this preprocessing stage,we start training the matching model with the HCLframework as described in Algorithm 1. With the rapid development of natural languageprocessing, building intelligent dialogue systemswith retrieval-based models has recently attractedmuch attention (Wu et al., 2017; Lu et al., 2019;Gu et al., 2019; Zhou et al., 2018; Gu et al., 2020).Early studies in this area devoted to responseselection for single-turn conversations (Wang et al.,2013; Tan et al., 2016). Recently, researchersturned to the scenario of multi-turn conversations.For instance, Wu et al. (2017) proposed to sepa-rately match the response and every utterance usinga convolutional neural network. Tao et al. (2019)fused words, n-grams representations of utterancesand capture dependencies on different levels.Another line of research studies how to improvethe performance of existing matching models withbetter learning algorithms. Wu et al. (2018) pro-posed to adopt a Seq2seq model as weak teacherto guide the training process. Feng et al. (2019) de-signed a co-teaching framework to attempt to elim-inate the training noises. Li et al. (2019) proposed In practice, there are many other possible options for theencoder structure, such as LSTM and RNN. o alleviate the problem of trivial negatives by ap-plying four different sampling strategies. Morerecently, Lin et al. (2020) attempted to diversifythe training negative examples with an ofﬂine re-trieval system and a pre-trained Seq2seq model.Different from those previous studies, our approachmakes use of the concept of curriculum learningto progressively strengthen the model’s ability viacorpus-level and instance-level training.

We test our approach on two benchmark multi-turnresponse selection datasets.

Douban Conversation Corpus

The DoubanConversation Corpus (Douban) (Wu et al., 2017)consists of multi-turn Chinese conversation datacrawled from Douban group . The size of training,validation and test sets are 500k, 25k and 1k. Inthe test set, each dialogue context is paired with10 candidate responses. Following previous works,we report the results of mean average precision(MAP), mean reciprocal rank (MRR) and precisionat position 1 (P@1). In addition, we also report theresults of R @

1, R @

2, R @5, where R n @ k means recall at position k in n candidates. Ubuntu Corpus

The Ubuntu Corpus (Loweet al., 2015) contains multi-turn dialogues collectedfrom chat logs of the Ubuntu Forum. The training,validation and test size are 500k, 50k and 50k. Eachdialogue context is paired with 10 response candi-dates. Following previous works, we use R @ @

1, R @ @ The following models are selected for comparison.

Single-turn Matching Models

This type ofmodels treats all dialogue context as a single longutterance and then measures the relevance scorebetween the context and response candidates, in-cluding RNN (Lowe et al., 2015), CNN (Loweet al., 2015), LSTM (Lowe et al., 2015), Bi-LSTM(Kadlec et al., 2015), MV-LSTM (Wan et al., 2016)and Match-LSTM (Wang and Jiang, 2016).

Multi-turn Matching Models

Instead of treat-ing the dialogue context as one single utterance,these models aggregate information from different utterances in more sophisticated ways, includingDL2R (Yan et al., 2016), Multi-View (Zhou et al.,2016), DUA (Zhang et al., 2018), DAM (Zhouet al., 2018), IOI (Tao et al., 2019), SMN (Wuet al., 2017) and MSN (Yuan et al., 2019). Pre-trained Language Models

Given the recentadvancement of pre-trained language models (De-vlin et al., 2019), Gu et al. (2020) proposed theSA-BERT model which adapts BERT for the taskof response selection and it is the current state-of-the-art model on the Douban and Ubuntu dataset.

For all experiments, we set the value of r , T and T in the corpus-level pacing function f p ( t ) as . , , and , , meaning that all models starttraining with , warm up steps using the datawhose corpus-level difﬁculty is lower than . . Thecorpus-level curriculum is completed after , steps. For the instance-level pacing function h p ( t ) ,the value of T and k T are set to be , and .This means that, after , training steps, thenegative responses of each training instance aresampled from the top similar responses. Tobuild the ranking model G ( · , · ) , we use a -layertransformers with a hidden size of .Among the compared baselines, in the experi-ments, we select two representative models (SMNand MSN) along with the state-of-the-art model(SA-BERT) to test the proposed approach. Eachmodel is trained with , steps with a batchsize of 128. To simulate the true testing environ-ment, the number of negative responses ( m in Eq.(1)) is set to be 10. Table 2 shows the results on Douban and Ubuntudataset, where X+HCL means training the modelX with the proposed learning framework. We cansee that our approach signiﬁcantly improves theperformance of all three matching models on allevaluation metrics, showing the robustness and uni-versality of our approach. We also observe that, bytraining with the proposed learning framework, amodel (MSN) without any pre-trained knowledgecould surpass the state-of-the-art model SA-BERTon both datasets. These results suggest that, whilethe training strategy is under-explored in previousstudies, it could be very decisive for building acompetent response selection model. odel Douban Ubuntu

MAP MRR P @ @ @ @ @ @ @ @ RNN 0.390 0.422 0.208 0.118 0.223 0.589 0.768 0.403 0.547 0.819CNN 0.417 0.440 0.226 0.121 0.252 0.647 0.848 0.549 0.684 0.896LSTM 0.485 0.527 0.320 0.187 0.343 0.720 0.901 0.638 0.784 0.949BiLSTM 0.479 0.514 0.313 0.184 0.330 0.716 0.895 0.630 0.780 0.944MV-LSTM 0.498 0.538 0.348 0.202 0.351 0.710 0.906 0.653 0.804 0.946Match-LSTM 0.500 0.537 0.345 0.202 0.348 0.720 0.904 0.653 0.799 0.944DL2R 0.488 0.527 0.330 0.193 0.342 0.705 0.899 0.626 0.783 0.944Multi-View 0.505 0.543 0.342 0.202 0.350 0.729 0.908 0.662 0.801 0.951DUA 0.551 0.599 0.421 0.243 0.421 0.780 - 0.752 0.868 0.962DAM 0.550 0.601 0.427 0.254 0.410 0.757 0.938 0.767 0.874 0.969IOI 0.573 0.621 0.444 0.269 0.451 0.786 0.947 0.796 0.894 0.974SMN 0.529 0.569 0.397 0.233 0.396 0.724 0.926 0.726 0.847 0.961MSN 0.587 0.632 0.470 0.295 0.452 0.788 - 0.800 0.899 0.978SA-BERT 0.619 0.659 0.496 0.313 0.481 0.847 0.965 0.855 0.928 0.983SMN+HCL 0.575 0.620 0.446 0.281 0.452 0.807 0.947 0.777 0.885 0.981MSN+HCL 0.620 0.668 0.507 0.321 0.508 0.841 0.969 0.826 0.924 0.989SA-BERT+HCL

Table 2: Experimental results of different models trained with our approach on Douban and Ubuntu datasets. Allresults acquired using HCL outperforms the original results with a signiﬁcance level p -value < . . CC IC

SMN MSN SA-BERT P @ @ @ @ @ @ @ @ @ × × (cid:88) × × (cid:88) (cid:88) (cid:88) Table 3: Ablation study on Douban dataset using different combinations of the proposed curriculum strategies.

To investigate the effect of CC and IC, we train dif-ferent models on Douban dataset by interchange-ably using the CC and IC. By disabling CC, werandomly select the training context-response pairs.By disabling IC, we randomly select the negativeexamples that associated to each training instance.

Ablation Study

The experimental results areshown in Table 3, from which we can see that bothCC and IC make positive contribution to the overallperformance. By combining them together, the op-timal performance can be achieved which indicatesthat CC and IC are complementary to each other.We also ﬁnd that only incorporating IC leads tolarger improvements than only using CC. This sug-gests that the ability of identifying the mismatchedinformation is more important factor for the modelto achieve its optimal performance.

Learning Efﬁciency

In Figure 3, we comparethe learning curves of different models (SMN andMSN) on Douban dataset with different curricu-lum setups. We observe that different models con-sistently beneﬁt from the proposed approach. To achieve the same performance as the best basemodel result, we observe 72% training time reduc-tion in SMN (8k vs. 28k steps) and 65% trainingtime reduction in MSN (12k vs. 34k steps) by usingthe full HCL framework. Therefore, we concludethat our approach is beneﬁcial both in terms of themodel performance and the learning efﬁciency.

Next, we examine the effect of different choicesof the ranking model architecture. To this end, webuild two variants by replacing the Transformersmodule E c ( · ) and E r ( · ) in Eq. (4) with two othermodules. For the ﬁrst variant, we use -layer BiL-STMs with hidden size of 256. For the second one,we use BERT-base (Devlin et al., 2019) models.For comparison, we then train different matchingmodels using the proposed HCL but with differentranking model as the scoring basis.The results on Douban dataset are shown in Ta-ble 5. We ﬁrst compare the performance of differ-ent ranking models by directly using them to selectthe best responses and the results are shown in the“Ranker” row of Table 5. Among all three variants, odel Strategy Douban Ubuntu MAP MRR P @ @ @ @ @ @ @ SMN Semi 0.554 0.605 0.425 0.253 0.412 0.934 0.762 0.865 0.967Gray 0.564 0.615 0.443 0.271 0.439 0.938 0.765 0.873 0.969HCL

MSN Semi (cid:63)

SA-BERT Semi (cid:63) (cid:63)

Table 4: Comparisons on Douban and Ubuntu datasets using different training strategies on various models. Resultsmarked with (cid:63) are from our runs with their released code.Figure 3: Plots illustrating the performance (P@1) ofSMN and MSN models on the Douban dataset, as train-ing progresses. The red line represents the base modelwithout using any curriculum. Others represent thesame model but with different curriculum setups.

BERT performs the best but it is still less accuratethan sophisticated matching models. Next, we ex-amine the effects of different ranking models on thematching model performance. We can observe that,for different matching models, Transformers andBERT perform comparably but the results fromBiLSTM are much worse. This further leads toa conclusion that, while the choice of ranker doeshave impact on the overall results, the improvementof ranking model performance does not necessarilylead to the improvement of matching model resultsonce it achieves certain accuracy.

Ranker

Model P @ @ @ Transformers Ranker 0.400 0.253 0.416SMN 0.446

SA-BERT

MSN Table 5: Comparisons of different ranker architectures.Best results for each matching model are bold-faced . As described in §4, Li et al. (2019) and Lin et al.(2020) also investigated better strategies to train thematching model which makes their work compa-rable to ours. Table 4 shows the results of variousmatching models trained with different strategies,where Semi and Gray refer to the approach in Liet al. (2019) and Lin et al. (2020) respectively. Wecan see that our approach consistently outperformsother methods on all dataset and matching modelsettings. The performance gains of our approachare even more remarkable given its simplicity; Ourapproach does not require running additional gen-eration models (Lin et al., 2020) or re-scoring neg-ative samples at different epochs (Li et al., 2019).

In this work, we propose a novel hierarchical cur-riculum learning framework for training responseselection models for multi-turn conversations. Dur-ing training, the proposed framework simultane-ously employs the corpus-level and instance-levelcurriculum to dynamically select suitable trainingata based on the state of learning process. Exten-sive experiments and analysis on two benchmarkdatasets show that our approach can signiﬁcantlyimprove the performance of various strong match-ing models.

References

Yoshua Bengio, J´erˆome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. In

Proceedings of the 26th Annual International Con-ference on Machine Learning, ICML 2009, Mon-treal, Quebec, Canada, June 14-18, 2009 , pages 41–48.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 57th Conference of the As-sociation for Computational Linguistics, ACL 2019,Florence, Italy, July 28- August 2, 2019, Volume1: Long Papers , pages 3805–3815. Association forComputational Linguistics.Jia-Chen Gu, Tianda Li, Quan Liu, Zhen-Hua Ling,Zhiming Su, Si Wei, and Xiaodan Zhu. 2020.Speaker-aware BERT for multi-turn response selec-tion in retrieval-based chatbots. In

CIKM ’20: The29th ACM International Conference on Informationand Knowledge Management, Virtual Event, Ireland,October 19-23, 2020 , pages 2041–2044. ACM.Jia-Chen Gu, Zhen-Hua Ling, and Quan Liu. 2019. In-teractive matching network for multi-turn responseselection in retrieval-based chatbots. In

Proceedingsof the 28th ACM International Conference on Infor-mation and Knowledge Management, CIKM 2019,Beijing, China, November 3-7, 2019 , pages 2321–2324.Rudolf Kadlec, Martin Schmid, and Jan Kleindienst.2015. Improved deep learning baselines for ubuntucorpus dialogs.

CoRR , abs/1510.03753.Vladimir Karpukhin, Barlas Oguz, Sewon Min, PatrickS. H. Lewis, Ledell Wu, Sergey Edunov, DanqiChen, and Wen-tau Yih. 2020. Dense passage re-trieval for open-domain question answering. In

Pro-ceedings of the 2020 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2020,Online, November 16-20, 2020 , pages 6769–6781.Association for Computational Linguistics. Thomas Kollar, Danielle Berry, Lauren Stuart,Karolina Owczarzak, Tagyoung Chung, LambertMathias, Michael Kayser, Bradford Snow, and Spy-ros Matsoukas. 2018. The alexa meaning represen-tation language. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 3(Industry Papers) , pages 177–184.Jia Li, Chongyang Tao, Wei Wu, Yansong Feng,Dongyan Zhao, and Rui Yan. 2019. Sampling mat-ters! an empirical study of negative sampling strate-gies for learning of matching models in retrieval-based dialogue systems. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 1291–1296, Hong Kong,China. Association for Computational Linguistics.Zibo Lin, Deng Cai, Yan Wang, Xiaojiang Liu, Hai-Tao Zheng, and Shuming Shi. 2020. The world isnot binary: Learning to rank with grayscale data fordialogue response selection.Cao Liu, Shizhu He, Kang Liu, and Jun Zhao. 2018.Curriculum learning for natural answer generation.In

Proceedings of the Twenty-Seventh InternationalJoint Conference on Artiﬁcial Intelligence, IJCAI2018, July 13-19, 2018, Stockholm, Sweden , pages4223–4229. ijcai.org.Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems. In

Proceedings of the SIGDIAL 2015Conference, The 16th Annual Meeting of the Spe-cial Interest Group on Discourse and Dialogue, 2-4 September 2015, Prague, Czech Republic , pages285–294. The Association for Computer Linguis-tics.Junyu Lu, Chenbin Zhang, Zeying Xie, Guang Ling,Tom Chao Zhou, and Zenglin Xu. 2019. Construct-ing interpretive spatio-temporal features for multi-turn responses selection. In

Proceedings of the 57thConference of the Association for ComputationalLinguistics, ACL 2019, Florence, Italy, July 28- Au-gust 2, 2019, Volume 1: Long Papers , pages 44–50.Emmanouil Antonios Platanios, Otilia Stretcu, GrahamNeubig, Barnab´as P´oczos, and Tom M. Mitchell.2019. Competence-based curriculum learning forneural machine translation. In

Proceedings of the2019 Conference of the North American Chapterof the Association for Computational Linguistics:Human Language Technologies, NAACL-HLT 2019,Minneapolis, MN, USA, June 2-7, 2019, Volume 1(Long and Short Papers) , pages 1162–1172. Associ-ation for Computational Linguistics.Alan Ritter, Colin Cherry, and William B. Dolan. 2011.Data-driven response generation in social media. In roceedings of the 2011 Conference on EmpiricalMethods in Natural Language Processing, EMNLP2011, 27-31 July 2011, John McIntyre ConferenceCentre, Edinburgh, UK, A meeting of SIGDAT, a Spe-cial Interest Group of the ACL , pages 583–593.Shreyas Saxena, Oncel Tuzel, and Dennis DeCoste.2019. Data parameters: A new family of param-eters for learning a differentiable curriculum. In

Advances in Neural Information Processing Systems32: Annual Conference on Neural Information Pro-cessing Systems 2019, NeurIPS 2019, 8-14 Decem-ber 2019, Vancouver, BC, Canada , pages 11093–11103.Heung-Yeung Shum, Xiaodong He, and Di Li. 2018.From eliza to xiaoice: Challenges and opportunitieswith social chatbots.

CoRR , abs/1801.01957.Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Ju-rafsky. 2010. From baby steps to leapfrog: How”less is more” in unsupervised dependency parsing.In

Human Language Technologies: Conference ofthe North American Chapter of the Association ofComputational Linguistics, Proceedings, June 2-4,2010, Los Angeles, California, USA , pages 751–759.The Association for Computational Linguistics.Ming Tan, Cicero dos Santos, Bing Xiang, and BowenZhou. 2016. Lstm-based deep learning models fornon-factoid answer selection.Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu,Dongyan Zhao, and Rui Yan. 2019. One time ofinteraction may not be enough: Go deep with aninteraction-over-interaction network for response se-lection in dialogues. In

Proceedings of the 57th Con-ference of the Association for Computational Lin-guistics, ACL 2019, Florence, Italy, July 28- August2, 2019, Volume 1: Long Papers , pages 1–11.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA , pages 5998–6008.Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo,Liang Pang, and Xueqi Cheng. 2016. Match-srnn:Modeling the recursive matching structure with spa-tial RNN. In

Proceedings of the Twenty-Fifth Inter-national Joint Conference on Artiﬁcial Intelligence,IJCAI 2016, New York, NY, USA, 9-15 July 2016 ,pages 2922–2928. IJCAI/AAAI Press.Hao Wang, Zhengdong Lu, Hang Li, and Enhong Chen.2013. A dataset for research on short-text conversa-tions. In

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP 2013, 18-21 October 2013, Grand Hy-att Seattle, Seattle, Washington, USA, A meeting ofSIGDAT, a Special Interest Group of the ACL , pages935–945. ACL. Shuohang Wang and Jing Jiang. 2016. Learning natu-ral language inference with LSTM. In

NAACL HLT2016, The 2016 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, San DiegoCalifornia, USA, June 12-17, 2016 , pages 1442–1451. The Association for Computational Linguis-tics.Yu Wu, Wei Wu, Zhoujun Li, and Ming Zhou. 2018.Learning matching models with weak supervisionfor response selection in retrieval-based chatbots. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2018,Melbourne, Australia, July 15-20, 2018, Volume 2:Short Papers , pages 420–425.Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhou-jun Li. 2017. Sequential matching network: Anew architecture for multi-turn response selectionin retrieval-based chatbots. In

Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2017, Vancouver, Canada,July 30 - August 4, Volume 1: Long Papers , pages496–505.Rui Yan, Yiping Song, and Hua Wu. 2016. Learningto respond with deep neural networks for retrieval-based human-computer conversation system. In

Pro-ceedings of the 39th International ACM SIGIR con-ference on Research and Development in Informa-tion Retrieval, SIGIR 2016, Pisa, Italy, July 17-21,2016 , pages 55–64. ACM.Chunyuan Yuan, Wei Zhou, Mingming Li, ShangwenLv, Fuqing Zhu, Jizhong Han, and Songlin Hu. 2019.Multi-hop selector network for multi-turn responseselection in retrieval-based chatbots. In

Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing, EMNLP-IJCNLP 2019, Hong Kong, China,November 3-7, 2019 , pages 111–120. Associationfor Computational Linguistics.Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, HaiZhao, and Gongshen Liu. 2018. Modeling multi-turn conversation with deep utterance aggregation.In

Proceedings of the 27th International Conferenceon Computational Linguistics, COLING 2018, SantaFe, New Mexico, USA, August 20-26, 2018 , pages3740–3752. Association for Computational Linguis-tics.Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao,Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan.2016. Multi-view response selection for human-computer conversation. In

Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2016, Austin, Texas,USA, November 1-4, 2016 , pages 372–381. The As-sociation for Computational Linguistics.Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.018. Multi-turn response selection for chatbotswith deep attention matching network. In