[PDF] Ultra-Fast, Low-Storage, Highly Effective Coarse-grained Selection in Retrieval-based Chatbot by Using Deep Semantic Hashing

Abstract

We study the coarse-grained selection module in retrieval-based chatbot. Coarse-grained selection is a basic module in a retrieval-based chatbot, which constructs a rough candidate set from the whole database to speed up the interaction with customers. So far, there are two kinds of approaches for coarse-grained selection module: (1) sparse representation; (2) dense representation. To the best of our knowledge, there is no systematic comparison between these two approaches in retrieval-based chatbots, and which kind of method is better in real scenarios is still an open question. In this paper, we first systematically compare these two methods from four aspects: (1) effectiveness; (2) index stoarge; (3) search time cost; (4) human evaluation. Extensive experiment results demonstrate that dense representation method significantly outperforms the sparse representation, but costs more time and storage occupation. In order to overcome these fatal weaknesses of dense representation method, we propose an ultra-fast, low-storage, and highly effective Deep Semantic Hashing Coarse-grained selection method, called DSHC model. Specifically, in our proposed DSHC model, a hashing optimizing module that consists of two autoencoder models is stacked on a trained dense representation model, and three loss functions are designed to optimize it. The hash codes provided by hashing optimizing module effectively preserve the rich semantic and similarity information in dense vectors. Extensive experiment results prove that, our proposed DSHC model can achieve much faster speed and lower storage than sparse representation, with limited performance loss compared with dense representation. Besides, our source codes have been publicly released for future research.

Full PDF

UUltra-Fast, Low-Storage, Highly Effective Coarse-grained Selection inRetrieval-based Chatbot by Using Deep Semantic Hashing

Tian Lan , Xian-Ling Mao , Xiaoyan Gao , Wei Wei , and Heyan Huang Beijing Institute of Technology [email protected] , {maoxl,xygao,hhy63}@bit.edu.cn Huazhong University of Science and Technology

[email protected]

Abstract

We study the coarse-grained selection modulein retrieval-based chatbot. Coarse-grained se-lection is a basic module in a retrieval-basedchatbot, which constructs a rough candidateset from the whole database to speed up theinteraction with customers. So far, there aretwo kinds of approaches for coarse-grainedselection module: (1) sparse representation;(2) dense representation. To the best of ourknowledge, there is no systematic compari-son between these two approaches in retrieval-based chatbots, and which kind of methodis better in real scenarios is still an openquestion. In this paper, we ﬁrst systemati-cally compare these two methods from fouraspects: (1) effectiveness; (2) index stoarge;(3) search time cost; (4) human evaluation.Extensive experiment results demonstrate thatdense representation method signiﬁcantly out-performs the sparse representation, but costsmore time and storage occupation. In orderto overcome these fatal weaknesses of denserepresentation method, we propose an ultra-fast, low-storage, and highly effective D eep S emantic H ashing C oarse-grained selectionmethod, called DSHC model. Speciﬁcally, inour proposed DSHC model, a hashing optimiz-ing module that consists of two autoencodermodels is stacked on a trained dense represen-tation model, and three loss functions are de-signed to optimize it. The hash codes providedby hashing optimizing module effectively pre-serve the rich semantic and similarity informa-tion in dense vectors. Extensive experimentresults prove that, our proposed DSHC modelcan achieve much faster speed and lower stor-age than sparse representation, with limitedperformance loss compared with dense repre-sentation. Besides, our source codes have beenpublicly released for future research . https://github.com/gmftbyGMFTBY/HashRetrieval Retrieval technique or response selection is a verypopular and elegant approach to framing a chatboti.e. open-domain dialog system. Given the con-versation context, a retrieval-based chatbot aims toselect the most appropriate utterance as a responsefrom a pre-constructed database. In order to bal-ance the effectiveness and efﬁciency, mosts of theretrieval-based chatbots (Fu et al., 2020) employcoarse-grained selection module to recall a set ofcandidate that are semantic coherent with the con-versation context to speed up processing.To the best of our knowledge, there are twokinds of approaches to build a coarse-grained se-lection module in retrieval-based chatbots: (1)sparse representation: TF-IDF or BM25 (Robert-son and Zaragoza, 2009) is a widely used method.It matches keywords with an inverted index andcan be seen as representing utterances in highdi-mensional sparse vectors (Karpukhin et al., 2020);(2) dense representation: Large scale pre-trainedlangauge models (PLMs), e.g. BERT (Devlin et al.,2019) are commonly used to obtain the semanticrepresentation of utterances, which could be usedto recall semantic coherent candidates by usingcosine similarity (Karpukhin et al., 2020).So far, there is no systematic comparison be-tween these two kinds of approaches in retrieval-based chatbots, and which kind of method is mostappropriate in real scenarios is still an open ques-tion that confuses researchers in dialog system com-munity. Thus, in this paper, we ﬁrst conduct exten-sive experiment to compare these two approachesfrom four important aspects: (1) effectiveness; (2)search time cost; (3) index storage occupation; (4)human evaluation. Extensive experiment resultson four popular response selection datasets demon-strate that the dense representation signiﬁcantly out-performs the sparse representation at the expense of a r X i v : . [ c s . C L ] D ec he lower speed and bigger storage than sparse rep-resentation, which is unsufferable in real scenarios.Then, in order to overcome the fatal weaknessesof dense representation methods, we propose anultra-fast, low-storage and highly effective D eep S emantic H ashing C oarse-grained selection mod-ule (DSHC) which effectively balances the effec-tiveness and efﬁciency. Speciﬁcally, we ﬁrst stacka novel hashing optimizing module that consists oftwo autoencoders on a given dense representationmethod. Then, three well designed loss functionsare used to optimize these two autoencoders inhashing optimizing module: (1) preserved loss; (2)hash loss; (3) quantization loss. After training, theautoencoders could effectively preserve rich seman-tic and similarity information of the dense vectorsinto the hash codes, which are very computationaland storage efﬁcient (Wang et al., 2018). Exten-sive experiment results on four popular responseselection datasets demonstrate that our proposedDSHC model can achieve much faster search speedand lower storage occupation than sparse repre-sentation method, and very limited performanceloss compared with the given dense representationmethod.In this paper, our contributions are three-fold:• We systematically compare current two kindsof coarse-grained selection methods in open-domain retrieval-based dialog systems fromfour important aspects: (1) effectiveness; (2)search time cose; (3) stoarge occupation; (4)human evaluation.• We propose an ultra-fast, low-storage, andhighly effective deep semantic hashing coarse-grained selection method, called DSHC,which overcomes the fatal weaknesses of thedense representation method.• We have publicly released our source codesfor future search.The rest of this paper is organized as follows: weintroduce the important concepts and backgroundcovered in our paper in Section 2. The experimentsettings is presented in Section 3. In Section 4, wesystematically compare the current two kinds ofmethods in coarse-grained selection module: (1)sparse representation; (2) dense representation. InSection 5, we introduce our proposed DSHC model,and detailed experiment results are elaborated. InSection 6, we conduct the case study. Finally, we conclude our work in Section 7. Due to the pagelimitation, more details and extra analysis can befound in Appendix . Retrieval-based chatbots, or retrieval-based open-domain dialog systems, which are widely used inthe real scenarios, have gained great progress overthe past few years. So far, most of the retrieval-based chatbots contain two modules (Fu et al.,2020; Luan et al., 2020): coarse-grained selectionmodule and ﬁne-graied selection module.

Coarse-grained selection module recalls a set ofcandidate responses that are semantic coherent withthe conversation context from the pre-constructeddatabase. As described before, there are two kindsof approaches to construct a coarse-grained selec-tion module: sparse and dense representation.

Sparse representation:

Due to the simply im-plementation and effective performance, sparse rep-resentation methods, represented by TF-IDF andBM25 (Robertson and Zaragoza, 2009), have beenwidely used in lots of real applications. Becausethe utterance that has the keywords overlap withthe conversation context is likely to be an appro-priate candidate response, sparse representationcould effectively recall appropriate candidates forthe ﬁne-grained selection module.The advantage of this method is that it runs veryquickly. As shown in Table 6, with the help of thewell designed data structure, such as inverted indexand skiplist, it can achieve the best computationalcomplexity O (log n ) . However, there are still lotsof appropriate candidate responses that don’t havethe word overlap with the conversation context, buthave very high semantic correlation with the con-text. Sparse representation cannot effectively ﬁndthese cases in the pre-constructed database, whichmay lead to the bad performance. For example,as shown in Table 1, it can be found that, the ra-tio of the ground-truths that can be retrieved byconsidering word-overlap is low. Dense representation : Recently, dense repre-sentation methods, represented by dual-encoder ar-chitecture, (Lowe et al., 2015; Tahami et al., 2020;Humeau et al., 2020; Karpukhin et al., 2020), haveattracted increasing attention of researchers, be-cause the rich semantic information could be effec-ively leveraged. Besides, large scale pre-trainedlanguage models (PLMs) signiﬁcantly boost theperformance of dense representation methods. Asshown in Figure 1, it can be found that, a dense rep-resentation method that leverages the dual-encoderarchitecture contains two modules: (1) seman-tic encoders (Humeau et al., 2020; Tahami et al.,2020; Karpukhin et al., 2020) are used to obtain thesemantic representations of context and candidateresponses. It should be noted that, context seman-tic encoder and candidate semantic encoder don’tshare the parameters, and are optimized separatelyduring training; (2) matching degree is calculatedby using dot production or cosine similarity, andutterances that have Top-K matching degrees willbe selected as the candidates.However, due to the high computational bur-den of similarity calculating, dense representationmethod runs very slowly. As shown in Table 6, itcan be found that its computational complexity ismuch bigger than sparse representation methods.

Context Encoder Candidate Encoder 𝑥 𝑥 𝑥 𝑛 …… 𝑦 𝑦 𝑦 𝒏 …… Ctx Embd Cand Embd

Matching Degree

Figure 1: The dual-encoder architecture for coarse-grained selection module. x i and y i are the i th token inconversation context and candidate. The matching de-gree can be obtained by using dot production (Humeauet al., 2020). Based on the candidate responses provided by thecoarse-grained selection module, ﬁne-grained se-lection module selects the most appropriate one asthe ﬁnal response to the given conversation con-text. Over the past few years, there are numer-ous works proposed to improve the performanceof ﬁne-grained selection module in retrieval-basedchatbots (Wu et al., 2017; Zhang et al., 2018; Zhouet al., 2018; Tao et al., 2019b; Gu et al., 2019; Taoet al., 2019a; Yuan et al., 2019). Especially, re-cent works (Whang et al., 2019; Gu et al., 2020)achieve the state-of-the-art results for ﬁne-grainedselection by using large scale pre-trained language models (PLMs), e.g. BERT (Devlin et al., 2019).However, because of the diminishing returns (Bisket al., 2020), it becomes more and more difﬁcult toimprove the open-domain dialog systems by updat-ing the ﬁne-grained selection module. Comparedwith the ﬁne-grained selection module, there arevery few works to study the coarse-grained selec-tion module, which is a potential breakthrough toimprove the retrieval-based open-domain dialogsystems further, and ignored by most of works.In this paper, a ﬁne-grained selection moduleserves two purposes: (1) Construct a reliablemetric that measures the average correlation be-tween the conversation context and candidates; (2)Build retrieval-based chatbots with different coarse-grained selection modules to measure their wholeperformances.

Due to computational and storage efﬁciencies ofthe compact binary hash codes, hashing methodshas been widely used for large-scale similaritysearch (Xu et al., 2015). The main methodology ofdeep hashing is similarity preserving, i.e., minimiz-ing the gap between the similarities computed inthe original space and the similarities in the hashingcode space (Wang et al., 2018). After optimizing,the hashing codes could save the rich semantic in-formation and similarity information in originaldense vectors.

In this paper, we select four popular chinese open-domain dialog datasets:•

E-Commerce Corpus (Zhang et al., 2018) iscollected from the real world conversationsbetween the customers and the service stafffrom the largest ecommerce platform Taobao .It is commonly used to test the multi-turn re-sponse selection models (Zhang et al., 2018;Yuan et al., 2019).• Douban Corpus (Wu et al., 2017) is anotherpopular response selection dataset, which con-tains dyadic dialogs crawle from the DoubanGroup . It should be noted that, in the origi-nal test dataset, each conversation context may ave multiple ground-truths, and we ignorethese cases in this paper.• Zh50w Corpus is a chinese open-domain di-alog corpus. It is crawled from the Weibosocial network platform, which has more ca-sual conversations than Douban Corpus andE-Commerce Corpus.• LCCC Corpus (Wang et al., 2020) is a large-scale cleaned chinese open-domain conversa-tion dataset. The quality of LCCC Corpus isensured by a rigorous data cleaning pipeline,which is built based on a set of rules and aclassiﬁer. The size of the original LCCC Cor-pus is very huge, and we randomly sample 2million conversations in this paper.For each corpus, we save all of the responsesin train and test datasets into corresponding pre-constructed database, which is used by corase-grained selection module. The details of thesedatasets are shown in Table 1.

Datasets Train Test Retrieval Ratio Database SizeE-Commerce

1M 1,000 46.81% 109,105

Douban

1M 667 54.57% 442,280

Zh50w

1M 3,000 28.5% 388,614

LCCC

4M 10,000 33.59% 1,651,899

Table 1: Data statistic of four popular datasets.

Re-trieval Ratio is the proportion of samples that can beretrieved by sparse representation methods.

DatabaseSize is the number of the utterances saved in pre-constructed database.

In this paper, three coarse-grained selection meth-ods are measured: (1)

BM25 (Robertson andZaragoza, 2009): following the previous works(Karpukhin et al., 2020; Xiong et al., 2020; Luanet al., 2020), we select BM25 sparse representationmethod, which is widely used in real scenarios; (2)

Dense (Karpukhin et al., 2020): we select Denseas the dense representation method, which use thePLM-based dual-encoder architecture to constructthe coarse-grained selection module (Karpukhinet al., 2020; Luan et al., 2020); (3)

DSHC : our pro-posed deep semantic hashing based coarse-grainedselection method. More details are shown in Sec-tion 5. https://github.com/yangjianxin1/GPT2-chitchat To implement BM25 method, Elasticsearch is used in this paper, which is a very powerfulsearch engine based on the Lucene library. ForDense and DSHC methods, following the previ-ous works (Xiong et al., 2020; Karpukhin et al.,2020), FAISS (Johnson et al., 2017; Karpukhinet al., 2020) toolkit is used in this paper. Besides,GPU devices (GeForce GTX 1080 Ti) are used toaccelerate the searching process. To measure the performance of these coarse-grained selection modules in real scenarios, weselect four important evaluation metrics:

Effectiveness : Following previous work (Xionget al., 2020; Karpukhin et al., 2020), Cover-age@20/100 (Top-20/100) metric is used to eval-uate whether Top-20/100 retrieved candidates in-clude the ground-truth response. However, duringthe testing, we ﬁnd that this metric is not appro-priate to measure the effectiveness of the coarse-grained selection module. The reasons are as fol-lows: The Top-20/100 metric only demonstrateswhether only one ground-truth response can be re-trieved. It cannot reﬂect the quality of all of theretrieved candidates. A good coarse-grained selec-tion module should recalls candidates that are allsemantic coherent with the given conversation con-text, not only one candidate. Thus, in this paper,we propose a Correlation@20/100 (Correlation-20/100) as a more reliable metric to measure theeffectiveness. Speciﬁcally, we leverage a state-of-the-art ﬁne-grained selection module (Whang et al.,2019) that ﬁne-tunes on BERT model, to provideaverage correlation scores of the retrieved candi-dates.

Search Time Cost : Search time cost is a coremetric in a real application, which directly in-ﬂuences the interaction speed between chatbotsand customers. In this paper, we records theaverage time cost (milliseconds) that the coarse-grained selection module searches the bsz candi-date responses for each conversation context in testdataset, where bsz = 16 . Index Stoarge : For every coarse-grained selec-tion module, it will construct a index that is calcu-lated off-line to search the candidates. For sparserepresentation method, the index is a inverted indexstoring a mapping from keywords to its locations https://github.com/facebookresearch/faiss n a candidate response. For dense representationmethod, the index is a huge matrix M ∈ R n × d thatsaves the dense vectors of all candidate responses,where n is the number of the utterances in the pre-constructed database, and d is the length of vectors. Human Evaluation : In dialog system researchcommunity, human evaluation is the most reliablemetric to measure the performance of dialog sys-tems (Liu et al., 2016; Tao et al., 2018). In thispaper, for each corpus, three crowd-sourced an-notators are employed to evaluate the quality ofgenerated responses for 200 randomly sampledconversation context. It should be noted that, theresponses are generated by a whole retrieval-basedchatbot, which consists of one coarse-grained selec-tion module (BM25 or Dense or DSHC) and a state-of-the-art ﬁne-grained selection module (Whanget al., 2019). During the evaluation, the annota-tors are requested to select a preferred response, orvote a tie from two responses that are generatedby two retrieval-based chatbots. Besides, Cohen’skappa scores (Cohen, 1960) are used to measurethe intra-rater reliability.

In this section, we measure the performance oftwo kinds of coarse-grained selection module. Theexperiment results are shown in Table 2 and Table3, and we can make the following conclusions:

Effectiveness : As shown in Table 3, it can be ob-served that, dense representation method show theworse performance than BM25 method on Top-20/100 metrics. As described before, the Top-20/100 metrics are questionable to measure thequality of the retrieved candidates, because Top-20/100 metrics cannot consider the average coher-ence between the candidates and the given con-veresation context. As for the Correlation-20/100metrics, it can be found that the dense representa-tion signiﬁcantly outperforms the BM25 method.For example, compared with BM25 method, denserepresentation method achieves average 19.89%absolute improvement on Correlation-20 metric,which demonstrates that the candidates retrievedby dense representation method are more semanticcoherent with the conversation context.

Index Storage : Referring to the results in sixthcolumns in Table 3, it can be observed that thedense representation method has more than 200times the average index storage occupation of the BM25 method. As shown in Figure 3 (b), it can alsobe observed that, the index stoarge is even muchbigger than the pre-constructed database. The in-dex storage becomes too big to use in real sce-narios as more and more utterances are saved inpre-constructed database.

Search Time Cost : Referring to the results inthe seventh column in Table 3, although the com-putational complexity of Dense method is muchbigger than BM25 method, Dense achieves thesmaller searching time cost than BM25 method onE-Commerce Corpus and Douban Corpus, with thehelp of the parallel computing provided by GPUdevices. However, if the size of the pre-constructeddatabase becomes huge, the dense representationmethod still cost more time than BM25 method, forexample, LCCC Corpus.

Human Evaluation : As shown in Table 2, it canbe found that dense representation method bringsmore preferable responses compared with BM25method on these four datasets, which indicates thatrich semantic information captured by dense repre-sentation does improve the response quality.

Dense vs. BM25 Win Loss Tie KappaE-Commerce 0.5917

Douban 0.4783

Zh50w 0.5017

LCCC 0.5233

Table 2: Human evaluation of

Dense vs. BM25 onfour datasets. Very high Cohen’s kappa scores provethe high consistency among the annotators.

Compared with BM25 method, dense represen-tation method could achieve better performancebut cost more time and index storage occupation,which is unsatisﬁed in real scenarios. In order toovercome these fatal weaknesses, in next sestion,we propose a novel deep semantic hashing basedcoarse-grained selection module, called DSHC.

The overview of our proposed DSHC model isshown in Figure 2, which contains two parts: con-versation embedding and hashing optimizing .For conversation embedding part, we leveragea trained dense representation coarse-grained selec- a) Experiment results on E-Commerce Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25

Dense (gpu) 0.204 0.413 0.9537 0.9203

320 Mb (b) Experiment results on Douban Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25 0.063

Dense (gpu) (c) Experiment results on Zh50w Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25 0.0627 0.1031 (d) Experiment results on LCCC Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25 0.0376

44 Mb 190.1ms/247msDense (gpu)

Table 3: The comparison between the BM25 method and Dense method. Dense method signiﬁcantly outperformsBM25 method, but cost more time and index stoarge occupation. tion module. Given a conversation context { x i } ni =1 and a candidate response { y i } mi =1 , where n and m is the number of the tokens, the conversation em-bedding separately encodes them into the denseembeddings e ctx and e can . The trained dense rep-resentation method ensures that the dense vector ofan appropriate response e can is very similar to thedense vector of context embedding e ctx , otherwiseit is not.For hashing optimizing part, DSHC model op-timizes two deep autoencoders to generate the hashcodes h ctx and h can for e ctx and e can by minimis-ing the objective function that consists of three lossfunctions: quantization loss , hash loss , and pre-served loss . Speciﬁcally, the hashing optimizingpart ﬁrst encodes the dense embeddings into theoutput vectors o ctx , o can : o ctx = Encoder ctx ( e ctx ) , o ctx ∈ R h o can = Encoder can ( e can ) , o can ∈ R h h ctx = sign ( o ctx ) , {− , } h h can = sign ( o can ) , {− , } h (1), where h is the hash code size. During inference, sign ( · ) function is used to convert o ctx and o can into the hash codes h ctx and h can . Then, hashingoptimizing part reconstructs the dense embeddingsfrom o ctx and o can : E ctx = Decoder ctx ( o ctx ) , E ctx ∈ R E can = Decoder can ( o can ) , E can ∈ R (2) , where E ctx and E can are the reconstructed denseembeddings, which assists to optimize the hashcodes. Context Encoder Candidate Encoder 𝑥 𝑥 𝑥 𝑛 …… 𝑦 𝑦 𝑦 𝑚 …… 𝒆 𝒄𝒕𝒙 𝒆 𝒄𝒂𝒏 𝒉 𝒄𝒕𝒙 FFN (768 → → h )FFN ( h → → 𝑬 𝒄𝒕𝒙 𝒉 𝒄𝒂𝒏 FFN (768 → → h )FFN ( h → → 𝑬 𝒄𝒂𝒏 𝑳 𝑳 𝑳 Hash Loss

Quantization LossPreserved Loss

ConversationEmbedding

Hashing Optimizing

Figure 2: The overview of our proposed DSHC modelfor retrieval-based chatbots. DSHC model contains twoparts: conversation embedding and hashing optimiz-ing . Our proposed DSHC model aims to compressedthe dense vectors e ctx and e can into semanticsimilarity-preserving hash codes h ctx and h can thatan be efﬁciently computed in real scenarios. Be-sides, the hash code of an appropriate response h can should be very similar to the hash code ofthe conversation context h ctx , otherwise it is not.In order to achieve this goal, we design three lossfunctions to optimize the hashing optimizing part:(1) preserved loss; (2) hash loss; (3) quantizationloss. Preserved loss : To preserve rich semantic infor-mation in dense vectors into hash codes, the recon-structed dense embeddings E ctx and E can shouldbe similar to e ctx and e can . Thus, we design thepreserved loss to measure the difference between e ctx and E ctx , and between e can and E cab , whichare the L2 norm (Euclidean norm) losses: L p = (cid:107) e ctx − E ctx (cid:107) + (cid:107) e can − E can (cid:107) (3) Hash loss : Although preserved loss ensures thatthe o ctx and o can contains the rich semantic infor-mation in e ctx and e can , there is still no way tomeasure the similarity between the conversationcontext hash codes and candidate hash codes. Inorder to ensure that hash codes could preserve thesemantic similarity between the conversation con-text and the candidate response, the hash loss isdesigned. For hash codes in Hamming space, ifthe similarity S ( o ctx , o can ) = 1 , i.e., the candi-date is appropriate to the context, the Hammingdistance (cid:107) o ctx , o can (cid:107) H = 0 . − o Tctx o can ) be-tween o ctx and o can should be equal to 0, whichindicates that o Tctx o can should be equal to h , where h is the dimension of the hash codes; if the sim-ilarity S ( o ctx , o can ) = 0 , i.e., the candidate is in-appropriate to the context, the Hamming distance (cid:107) o ctx , o can (cid:107) H should be equal to h2 , which indi-cates that o Tctx o can should be equal to 0. Therefore,the hash loss is designed as following: L h = (cid:107) o Tctx o can − h S ( o ctx , o can ) (cid:107) s.t. S ( o ctx , o can ) ∈ { , } (4) Quantization loss : So far, preserved loss andhash loss ensure that o ctx and o can preseve thesemantic information and the similarity betweenthem. However, during inference, the hash codes h ctx and h can are used to search the candidates,which are roughly converted by using sign ( · ) func-tion. In order to narrow the gap between h ctx and o ctx , and h can and o can , the quantization loss(Wang et al., 2018) is used to ensure that each ele-ment of o ctx and o can can be close to “+1” or “-1”: L q = (cid:107) h ctx − o ctx (cid:107) + (cid:107) h can − o can (cid:107) (5)Finally, the overall objective function is obtainedas follows: L = L p + L h + γ t · L q s.t. γ t = γ min + γ max − γ min T · t (6), where γ t is a hyperparameter that dynamicallybalances the processing of optimizing hash lossand quantization loss. In this paper, γ min = 1 e − , γ max = 1 e − . T is the number of the mini-batchin one epoch, and t ∈ { , , ..., T − } is thecurrent running step. In this section, we carefully compare three coarse-grained selection methods: (1) BM25; (2) Dense;(3) our proposed DSHC model.

Effectiveness : As shown in Table 4, it can beobserved that, our proposed DSHC model signiﬁ-cantly outperforms the BM25 method. Besides, theperformance of DSHC model is very close to theDense representation method, which indicates thatour proposed DSHC model effectively preservesthe rich semantic information and the similarityinformation between the conversation context andcandidate response. For example, there is only2.6% absolute average decline on correlation-20metric for DSHC-512 model. In view of that com-pressed binary hash codes lost lots of information,the results are pretty good.

Index Storage : Furthermore, as shown in sixthcolumn in Table 4, it can also be found that, theindex storage occupation of our proposed DSHCmodel is much smaller than the Dense method,even smaller than BM25 method if the dimensionof the hash codes is 128.

Search Time Cost : Moreover, although thecomputational complexity of computing hammingdistance is worse than BM25, with the help of veryhigh computational efﬁciencies of hash codes, andparallel computing provided by GPU devices, ourproposed DSHC model still achieves the smallestsearch time cost i.e. the fastest search speed, thanBM25 methods. For example, DSHC-128 modelis nearly 15x faster than the widely used BM25method.

Human Evaluation : Finally, we also conductthe human evaluation to measure the performaneore accurately. As shown in Table 5 (a), it can befound that, the performance of DSHC and Densemethods are very close. Quite surprisingly, ourproposed DSHC model is even better than Densemethod on LCCC corpus. Beides, from Table 5(b), it can also be found that, DSHC model signiﬁ-cantly outperforms the widely used BM25 method,because DSHC model wins most of the time. Thevery high Cohen’s kappa scores demonstrate thatthe decision of the annotators are highly consistent. (a) Human Evaluation of

Dense vs. DSHC . Dense vs. DSHC Win Loss Tie KappaE-Commerce

Douban

Zh50w 0.395

LCCC (b) Human Evaluation of

DSHC vs. BM25 . DSHC vs. BM25 Win Loss Tie KappaE-Commerce 0.6017

Douban 0.4767

Zh50w 0.4733

LCCC 0.5317

Table 5: Human evaluation on four datasets. Veryhigh Cohen’s kappa scores prove the high consistencyamong the annotators.

Due to the page limitation, cases are shown in Ta-ble 8 in

Appendix . Refering to these cases, it can befound that, the retrieval-based chatbots that use thedense representation and our proposed DSHC meth-ods provide more semantic coherent responses tothe given conversation context than BM25 method.Besides, the responses given by dense represen-tation method and DSHC method are both veryappropriate, which proves the effectiveness of ourproposed DSHC model.

In this paper, we ﬁrst systematically compare thedense and sparse representation method in retrieval-based chatbot from four important aspects: (1) ef-fectiveness; (2) search time cost; (3) index stoarge;(4) human evaluation. Extensive experiment re-sults demonstrate that dense representation methodcould achieve better performance at the expense ofmore time cost and higher storage occupation, Inorder to overcome these fatal weaknesses, we pro-pose a deep semantic hashing based corase-grained(DSHC) selection method. Extensive experiment results prove the effectiveness and the efﬁciency ofDSHC model.

References

Yonatan Bisk, Ari Holtzman, Jesse Thomason, JacobAndreas, Yoshua Bengio, Joyce Chai, M. Lapata,A. Lazaridou, Jonathan May, Aleksandr Nisnevich,Nicolas Pinto, and Joseph P. Turian. 2020. Experi-ence grounds language. In

EMNLP .J. Cohen. 1960. A coefﬁcient of agreement for nomi-nal scales.

Educational and Psychological Measure-ment , 20:37 – 46.J. Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Bert: Pre-training of deep bidirec-tional transformers for language understanding. In

NAACL-HLT .Zhenxin Fu, Shaobo Cui, Mingyue Shang, Feng Ji,Dongyan Zhao, H. Chen, and R. Yan. 2020. Context-to-session matching: Utilizing whole session forresponse selection in information-seeking dialoguesystems.

Proceedings of the 26th ACM SIGKDD In-ternational Conference on Knowledge Discovery &Data Mining .Jia-Chen Gu, Tianda Li, Q. Liu, Xiao-Dan Zhu, Zhen-hua Ling, Zhiming Su, and Si Wei. 2020. Speaker-aware bert for multi-turn response selection inretrieval-based chatbots.

Proceedings of the 29thACM International Conference on Information &Knowledge Management .Jia-Chen Gu, Z. Ling, and Q. Liu. 2019. Interactivematching network for multi-turn response selectionin retrieval-based chatbots.

Proceedings of the 28thACM International Conference on Information andKnowledge Management .Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,and J. Weston. 2020. Poly-encoders: Architec-tures and pre-training strategies for fast and accuratemulti-sentence scoring. In

ICLR .Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2017.Billion-scale similarity search with gpus. arXivpreprint arXiv:1702.08734 .V. Karpukhin, Barlas O˘guz, Sewon Min, Patrick Lewis,Ledell Yu Wu, Sergey Edunov, Danqi Chen, andWen tau Yih. 2020. Dense passage retrieval for open-domain question answering. In

EMNLP .C. Liu, Ryan Lowe, I. Serban, Michael Noseworthy,Laurent Charlin, and Joelle Pineau. 2016. Hownot to evaluate your dialogue system: An empiri-cal study of unsupervised evaluation metrics for dia-logue response generation.

ArXiv , abs/1603.08023.Ryan Lowe, Nissan Pow, I. Serban, and Joelle Pineau.2015. The ubuntu dialogue corpus: A large datasetfor research in unstructured multi-turn dialogue sys-tems.

ArXiv , abs/1506.08909. a) Experiment results on E-Commerce Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25

Dense (gpu)

320 Mb 389.3ms/401.5ms

DSHC-128 (gpu) (b) Experiment results on Douban Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25 0.063

Dense (gpu)

DSHC-128 (gpu) (c) Experiment results on Zh50w Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25 0.0627 0.1031

Dense (gpu)

DSHC-128 (gpu) (d) Experiment results on LCCC Corpus.

Methods Top-20 Top-100 Correlation-20 Correlation-100 Index Storage Search Time Cost (20/100)BM25 0.0376

Dense (gpu)

DSHC-128 (gpu)

26 Mb 20.4ms/24.4msDSHC-512 (gpu)

Table 4: Parameters 128 and 512 are the dimension of the hash codes h in our proposed DSHC model.Yi Luan, Jacob Eisenstein, Kristina Toutanova, andM. Collins. 2020. Sparse, dense, and atten-tional representations for text retrieval. ArXiv ,abs/2005.00181.S. Robertson and H. Zaragoza. 2009. The probabilis-tic relevance framework: Bm25 and beyond.

Found.Trends Inf. Retr. , 3:333–389.Amir Vakili Tahami, Kamyar Ghajar, and A. Shakery.2020. Distilling knowledge for fast retrieval-basedchat-bots.

Proceedings of the 43rd InternationalACM SIGIR Conference on Research and Develop-ment in Information Retrieval .Chongyang Tao, Lili Mou, Dongyan Zhao, and R. Yan.2018. Ruber: An unsupervised method for auto-matic evaluation of open-domain dialog systems. In

AAAI .Chongyang Tao, W. Wu, Can Xu, Wenpeng Hu,Dongyan Zhao, and R. Yan. 2019a. One time ofinteraction may not be enough: Go deep with aninteraction-over-interaction network for response se-lection in dialogues. In

ACL .Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu,Dongyan Zhao, and R. Yan. 2019b. Multi-representation fusion network for multi-turn re-sponse selection in retrieval-based chatbots.

Pro-ceedings of the Twelfth ACM International Confer-ence on Web Search and Data Mining .Jingdong Wang, T. Zhang, Jingkuan Song, N. Sebe,and H. Shen. 2018. A survey on learning to hash.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 40:769–790.Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang,Y. Jiang, X. Zhu, and Minlie Huang. 2020. A large-scale chinese short-text conversation dataset.

ArXiv ,abs/2008.03946.T. Whang, Dongyub Lee, Chanhee Lee, Kisu Yang,Dongsuk Oh, and Heuiseok Lim. 2019. Domainadaptive training bert for response selection.

ArXiv ,abs/1908.04812.Yu Wu, Wei Yu Wu, M. Zhou, and Zhoujun Li. 2017.Sequential match network: A new architecture formulti-turn response selection in retrieval-based chat-bots. In

ACL .Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang,J. Liu, P. Bennett, Junaid Ahmed, and Arnold Over-wijk. 2020. Approximate nearest neighbor negativecontrastive learning for dense text retrieval.

ArXiv ,abs/2007.00808.Jiaming Xu, P. Wang, Guanhua Tian, Bo Xu, Jun Zhao,Fangyuan Wang, and Hongwei Hao. 2015. Convolu-tional neural networks for text hashing. In

IJCAI .Chunyuan Yuan, W. Zhou, M. Li, Shangwen Lv, F. Zhu,Jizhong Han, and Songlin Hu. 2019. Multi-hop se-lector network for multi-turn response selection inretrieval-based chatbots. In

EMNLP/IJCNLP .huosheng Zhang, Jiangtong Li, Pengfei Zhu, ZhaoHai, and G. Liu. 2018. Modeling multi-turn conver-sation with deep utterance aggregation. In

COLING .Xiangyang Zhou, L. Li, Daxiang Dong, Y. Liu, YingChen, Wayne Xin Zhao, D. Yu, and Hua Wu. 2018.Multi-turn response selection for chatbots with deepattention matching network. In

ACL . A Appendices

A.1 Computational Complexity of SearchOperation

The computational complexity of three coarse-grained selection methods are shown in Table 6.Although BM25 method achieves the best compu-tational complexity by using well designed datastructure, such as inverted index and skiplist, it can-not be accelerated by using GPU devices. In realscenarios, with the help of the parallel computingprovided by GPU devices, DSHC method couldachieve much faster searching speed.It should be noted that, lots of works have beenproposed to optimize the computational complex-ity of computing the dot production and Hammingdistance, such as product quantizer and inverted in-dex, and the computational complexities of Denseand DSHC methods shown in Table 6 are the worstcases. In this paper, we dont’t consider to lever-age these techniques to search candidates in Denseand DSHC methods. Brute-force search i.e. linearscan is used to ﬁnd the Top-K (20/100) candidatesin coarse-grained selection module, which directlyscans all of the utterances in the pre-constructeddatabase.

Coarse-grained Selection Computational ComplexityBM25 (Inverted index) O (log n ) Dense (Dot production) brute-force O ( d · n ) DSHC (Hamming distance) brute-force O ( n ) Table 6: The computational complexity of the differentcoarse-grained selection methods. n is the number ofthe utterances in the pre-constructed database. d is thedimension of the dense vectors. A.2 Hyperparameters Analysis

In this section, we analyze the hyperparameter h i.e.the dimension of the hash codes, in our proposedDSHC model. For our proposed DSHC model, weseparately test the 16,32,48,64,128,256,512,1024dimensions of the hash codes. The results areshown in Table 7. (a) Hyperparameters in E-Commerce Corpus. Methods Correlation-20 Correlation-100 Storage Time CostBM25

DSHC-16

214 Kb

DSHC-32

DSHC-64

DSHC-128

DSHC-256

DSHC-512

DSHC-1024 0.9473 0.9134

14 Mb 19.4ms/18.4ms

Dense (b) Hyperparameters in Douban Corpus.

Methods Correlation-20 Correlation-100 Storage Time CostBM25

DSHC-16DSHC-32DSHC-48DSHC-64DSHC-128

DSHC-256DSHC-512

DSHC-1024Dense (c) Hyperparameters in Zh50w Corpus.

Methods Correlation-20 Correlation-100 Storage Time CostBM25

DSHC-16

760 Kb

DSHC-32

DSHC-64

DSHC-128

DSHC-256

DSHC-512

DSHC-1024 0.9546 0.9336

48 Mb 50.2ms/64.7ms

Dense (d) Hyperparameters in LCCC Corpus.

Methods Correlation-20 Correlation-100 Storage Time CostBM25

DSHC-16DSHC-32DSHC-48DSHC-64DSHC-128

DSHC-256DSHC-512

DSHC-1024Dense

Table 7: The hyperparameters analysis in four datasets.

A.3 Case Study

The whole cases in four datasets are shown in Ta-ble 8, and all of the utterances are translated fromChinese to English. More cases can be found inthis page . A.4 Storage Occupation Visualization

The stoarge occupation of the pre-constructeddatabase and index are shown in Figure 3. https://github.com/gmftbyGMFTBY/HashRetrieval -Commerce Douban Zh50w LCCCDatasets10 S t o r a g e O cc u p a t i o n ( M b ) BM25 Storage Occupation

Database storageIndex storage (a) BM25 Storage.

E-Commerce Douban Zh50w LCCCDatasets10 S t o r a g e O cc u p a t i o n ( M b ) Dense Storage Occupation

Database storageIndex storage (b) Dense Storage.

E-Commerce Douban Zh50w LCCCDatasets S t o r a g e O cc u p a t i o n ( M b ) DSHC Storage Occupation

Database storageIndex storage (c) DSHC-128 Storage.

Figure 3: The index and pre-constructed database stor-age occpuation in four datasets. It should be noted thatthe scale is nonlinear. a) CasesinE-CommerceCorpus.

Conversation Context Ground-truth Dense DSHC BM25

What about this delivery?It’s the slowest delivery ever Hello? I’m sorry that all the EMS items toGuangdong are a little slow recently. Hello, I’m sorry that the courier is off work here,now I will help you leave a message to urge the courier. This is the slowest estimate.If I buy it today, when will it be shipped?35 days after purchaseIt’s so late. Yes I’m really sorrythere are so many orders before new year. Shipped 35 days after purchase When did you buy it?Why didn’t you send it to me? The dried durian has just arrivedWe will send it to you today or tomorrow. Already sent I’m really sorryWe are temporarily out of stockit is estimated that the goodswont be sent until the day after tomorrow It’s out of the warehousewhy are you returning it?I will change the price for you.How long can the goods be delivered to Xinjiang? We send goods from Jiaxing, Zhejiang.Generally, it will cost 4-6 days after delivery.Please wait patiently. About 23 days 4,5 days after delivery I will change the price for you.After you buy it, please send it as soon as possible.Ok, please check the delivery address.It’s Right Okay, we will send it to youas soon as possible. Okay, we will send it to you as soon as possible. Okay, we will send it to you as soon as possible. Ok, please check the delivery address.Why did the laundry soap I boughtarrived in Changshaand transferred to other places? Hello, we will contact the courier tohelp you verify the situation. Hello, we will help you contact the courier. Hello, there should be a problem with the transit,please contact the courier. It’s impossible.What’s the quality of walnuts? The quality is very good. Pretty good. Very good, very creamy. I’m really sorry for the large order quantityI have kept you waitingIs there any discount on this product recently?There will be a discounton the 16th of this month.You can focus it.Okay Yes. Please pay attention to this discount activity. Please pay attention to this discount activity. There will be a discount on the 16th of this month.Please send me some snacks to taste Okay, lol Ok Okay You are a foodie lolWhich kind of courier companyis used to deliver the goods?We use “Tian Tian express” by defaultPlease use EMS for deliveryOkayI have already paid Okay Now, please check the delivery addressWe can send it to you today Now, please check the delivery addressWe can send it to you today We send goods by EMS (b) CasesinDoubanCorpus.

Conversation Context Ground-truth Dense DSHC BM25

I would like to ask where thereare custom-made clothes in Zhengzhou.The leather market opposite the Chengdong Road Art Museummakes custom-made leather goods.Wow, I live nearby and I don’t knowHey, we live so close How close is it? Where do you live? Lol, it’s really close I’m from ZhengzhouEvery day I shit smoothly, but I still get hemorrhoidsSo do I.Take good care of stomach I also have a bad stomachso I don’t eat chili recently You have to take care of your stomach. I have been drinking red bean and barley soup for a month,and I feel better. So do IIt’s said that many companies don’t have year-end awards.Are you talking about my company?I didn’t mean it No, you did it on purpose. I mean mine, not yours, don’t be angry Alas, I’m mad at talking about beneﬁtsOur year-end bonus is too small You are insatiableLol, my sister has passed the examWhich school does your sister go to? Dalian University of Technologyhope not far from you Ludong University Ludong University My sister went to school in 2009I think hairy crabs are the most delicious.What is it? I have never heard of it.You must have eaten it. Maybe, I didn’t recognize it when I ate I ate once a long time agoand never eaten again I have no impression at alleither it’s too long or the taste is not good enough Hairy crabs in the ﬁsh marketAfter reading your post, I went to sign up.Did you sign up successfully?YesWhat position did you sign up for? I signed up for a national tax civil servantbut the location is uncertainhow about you? I have already signed up in the ﬁrst half of the yearbut I haven’t started to sign up yet I signed up for an oilﬁeld civil service institution Have you signed up (c) Cases in Zh50w Corpus.

Conversation Context Ground-truth Dense DSHC BM25

Health is very important,I exercise as much as possible.My leg hurts and I can’t exercise recently.Your leg injury has not recovered, so take care of it slowly!It has been almost three months.Take good care and don’t leave any sequelae! Thank you for your concern! Hmm, I recently went tothe hospital for a review. Well, I have been practicinginterval exercise recently. Well, a natural birth willhelp you recover quicklyWhat is the eighth picture?I haven’t seen it in the north of China.The ﬂower is golden triangle plum with white edges. Thank you very much Thanks! It’s so beautiful. Thanks! It’s so beautiful. I know. Where is this ﬂowerGood morning!Good morning, thank youLol, Why are you so polite? It’s not polite.I want to respect you I’m just kidding.I don’t use twitter very often.Let’s have a chat at WeChat. Lol, It’s not polite. Good morning What do you want to doYour glasses look good!Oh? How about me You look as good as glasses You look better than glasses! You look better than glasses! Your glasses are very artisticSister, why are you still up?I go to bed late every day Yes, many of your Weibo posts are posted early in the morning I have to sleep Insomnia is painful, sister.I didn’t expect you to sleep so late.if I knew I would chat with you Good morning sister!My friend sent me some mytilus edulis.Does anyone know how to cook it?Mytilus edulis, also known as oyster,can be cooked with soup, porridge, braised mushrooms. Thank you very much Ok, thank you. I’ll try it Okay, thank you for sharing Cut it into thin slicesand put it in the soup (d) CasesinLCCCCorpus.

Conversation Context Ground-truth Dense DSHC BM25