[PDF] Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Abstract

Human doctors with well-structured medical knowledge can diagnose a disease merely via a few conversations with patients about symptoms. In contrast, existing knowledge-grounded dialogue systems often require a large number of dialogue instances to learn as they fail to capture the correlations between different diseases and neglect the diagnostic experience shared among them. To address this issue, we propose a more natural and practical paradigm, i.e., low-resource medical dialogue generation, which can transfer the diagnostic experience from source diseases to target ones with a handful of data for adaptation. It is capitalized on a commonsense knowledge graph to characterize the prior disease-symptom relations. Besides, we develop a Graph-Evolving Meta-Learning (GEML) framework that learns to evolve the commonsense graph for reasoning disease-symptom correlations in a new disease, which effectively alleviates the needs of a large number of dialogues. More importantly, by dynamically evolving disease-symptom graphs, GEML also well addresses the real-world challenges that the disease-symptom correlations of each disease may vary or evolve along with more diagnostic cases. Extensive experiment results on the CMDD dataset and our newly-collected Chunyu dataset testify the superiority of our approach over state-of-the-art approaches. Besides, our GEML can generate an enriched dialogue-sensitive knowledge graph in an online manner, which could benefit other tasks grounded on knowledge graph.

Full PDF

GGraph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Shuai Lin , Pan Zhou , Xiaodan Liang , Jianheng Tang , Ruihui Zhao ,Ziliang Chen , and Liang Lin Sun Yat-sen University, Salesforce Research, DarkMatter AI Inc., Tencent Jarvis Lab { shuailin97, xdliang328, sqrt3tjh } @gmail.com, [email protected],[email protected], [email protected], [email protected] Abstract

Human doctors with well-structured medical knowledge candiagnose a disease merely via a few conversations withpatients about symptoms. In contrast, existing knowledge-grounded dialogue systems often require a large number ofdialogue instances to learn as they fail to capture the corre-lations between different diseases and neglect the diagnosticexperience shared among them. To address this issue, we pro-pose a more natural and practical paradigm, i.e., low-resourcemedical dialogue generation, which can transfer the diagnosticexperience from source diseases to target ones with a handfulof data for adaptation. It is capitalized on a commonsenseknowledge graph to characterize the prior disease-symptom re-lations. Besides, we develop a Graph-Evolving Meta-Learning(GEML) framework that learns to evolve the commonsensegraph for reasoning disease-symptom correlations in a newdisease, which effectively alleviates the needs of a large num-ber of dialogues. More importantly, by dynamically evolvingdisease-symptom graphs, GEML also well addresses the real-world challenges that the disease-symptom correlations ofeach disease may vary or evolve along with more diagnosticcases. Extensive experiment results on the CMDD dataset andour newly-collected Chunyu dataset testify the superiority ofour approach over state-of-the-art approaches. Besides, ourGEML can generate an enriched dialogue-sensitive knowledgegraph in an online manner, which could beneﬁt other tasksgrounded on knowledge graph.

Medical dialogue system (MDS) aims to converse withpatients to inquire additional symptoms beyond their self-reports and make a diagnosis automatically, which has gainedincreasing attention (Lin et al. 2019; Wei et al. 2018; Xu et al.2019). It has a signiﬁcant potential to simplify the diagnosticprocess and relieve the cost of collecting information frompatients (Kao, Tang, and Chang 2018). Moreover, prelimi-nary diagnosis reports generated by MDS may assist doctorsto make a diagnosis more efﬁciently. Because of these consid-erable beneﬁts, many researchers devote substantial effortsto address critical sub-problems in MDS, such as naturallanguage understanding (Shi et al. 2020; Lin et al. 2019), * Source diseases Target diseases

Figure 1: Statistics of 15 diseases in our newly collected

Chunyu dataset from the real world. One can observe the no-table data-imbalance phenomenon of different diseases. Thusit is highly desirable to study how to transfer the diagnosticexperience among diseases.dialogue policy learning (Wei et al. 2018), dialogue manage-ment (Xu et al. 2019), and make promising progress to builda satisfactory MDS.Medical dialogue generation (MDG), which generates re-sponses in natural language to request additional symptomsor make a diagnosis, is critical in MDS but rarely studied.Conventional generative dialogue models often employ neu-ral sequence modeling (Sutskever, Vinyals, and Le 2014;Vaswani et al. 2017) and cannot be applied to the medicaldialogue scenario directly in absence of medical knowledge.Recently, large-scale pre-training language models (Devlinet al. 2018; Radford et al. 2019; Song et al. 2019) over unsu-pervised corpora have achieved signiﬁcant success. However,ﬁne-tuning such large language models in the medical do-main requires sufﬁcient task-speciﬁc data (Bansal, Jha, andMcCallum 2019; Dou, Yu, and Anastasopoulos 2019) so asto learn the correlations between diseases and symptoms. Un-fortunately, as depicted in Fig. 1, there are a large portionof diseases that only have a few instances in practice, whichmeans that newly-coming diseases in the realistic diagnosisscenario are often under low-resource conditions. Therefore,it is highly desirable to transfer the diagnostic experiencefrom high-resource diseases to others of data scarcity. Be-sides, existing knowledge-grounded approaches (Liu et al.2018; Lian et al. 2019) may fail to perform such transferwell, as they only learn one uniﬁed model for all diseases andignore the speciﬁcity and relationships of different diseases.Finally, in practice, the disease-symptom relations of each a r X i v : . [ c s . C L ] D ec isease may vary or evolve along with more cases, which isalso not considered in prior works. Contributions.

To address the above challenges, we ﬁrstpropose an end-to-end dialogue system for the low-resourcemedical dialogue generation. This model integrates threecomponents seamlessly, a hierarchical context encoder, ameta-knowledge graph reasoning (MGR) network and agraph-guided response generator. Among them, the contextencoder encodes the conversation into hierarchical repre-sentations. For MGR, it mainly contains a parameterizedmeta-knowledge graph, which is initialized by a prior com-monsense graph and characterizes the correlations amongdiseases and symptoms. When fed into the context informa-tion, MGR can adaptively evolve its meta-knowledge graphto reason the disease-symptom correlations and then predictrelated symptoms of the patient in the next response to furtherdetermine the disease. Finally, the response generator gener-ates a response for symptoms request under the guidance ofthe meta-knowledge graph.The second contribution is that we further develop a novelGraph-Evolving Meta-Learning (GEML) framework to trans-fer the diagnostic experience in the low-resource scenario.Firstly, GEML trains the above medical dialogue model un-der the meta-learning framework. It regards generating re-sponses to a handful of dialogues as a task and learns ameta-initialization for the above dialogue model that can fastadapt to each task of the new disease with limited dialogues.In this way, the learnt model initialization contains sufﬁcientmeta-knowledge from all source diseases and can serve as agood model initialization to quickly transfer meta-knowledgeto a new disease. More importantly, GEML also learns a goodparameterized meta-knowledge graph in the MGR module tocharacterize the disease-symptom relationships from sourcediseases. Concretely, under the meta learning framework, foreach disease, GEML enriches the meta-knowledge graph viaconstructing a global-symptom graph from the online dia-logue examples. In this way, the learnt meta-knowledge graphcan bridge the gap between the commonsense medical graphand the real diagnostic dialogues and thus can be fast evolvedfor the new target disease. Thanks to graph evolving, the dia-logue model can request patients for underlying symptomsmore efﬁciently and thus improve the diagnostic accuracy.Besides, GEML can also well address the real-world chal-lenge that the disease-symptom correlations could vary alongwith more cases, since the meta-knowledge graph is trainablebased on collected dialogue examples.Finally, we construct a large medical dialogue dataset,called Chunyu . It covers 15 kinds of diseases and 12,842dialogue examples totally, and is much larger than the exist-ing CMDD medical dialogue dataset (Lin et al. 2019). Themore challenging benchmark can better comprehensivelyevaluate the performance of medical dialogue systems. Ex-tensive experimental results on both datasets demonstrate thesuperiority of our method over the state-of-the-arts. We name such knowledge as “meta-knowledge” since it isobtained through meta-training from different source diseases. Code and dataset are released at https://github.com/ha-lins/GEML-MDG.

Medical Dialogue System (MDS).

Recent research onMDS mostly focus on the natural language understand-ing (NLU) or dialogue management (DM) with the line ofpipeline-based dialogue system. Various NLU problems havebeen studied to improve the MDS performance, e.g. , en-tity inference (Du et al. 2019b; Lin et al. 2019; Liu et al.2020), symptom extraction (Du et al. 2019a) and slot-ﬁlling(Shi et al. 2020). For medical dialogue management, mostworks (Dhingra et al. 2017; Li et al. 2017) focus on reinforce-ment learning (RL) based task-oriented dialogue system. Weiet al. (2018) proposed to learn dialogue policy with RL tofacilitate automatic diagnosis. Xu et al. (2019) incorporatedthe knowledge inference into dialogue management via RL.However, no attention has been paid to the medical dialoguegeneration, which is a critical recipe in MDS. Differing fromexisting approaches, we investigate to build an end-to-endgraph-guided medical dialogue generation model directly.

Knowledge-grounded Dialog Generation.

Recently, dia-logue generation grounded on extra knowledge is emergingas an important step towards human-like conversational AI,where the knowledge could be derived from or open-domainknowledge graphs (Zhou et al. 2018; Zhang et al. 2020; Moonet al. 2019) or retrieved from unstructured documents (Lianet al. 2019; Zhao et al. 2019; Kim, Ahn, and Kim 2020).Different from them, our MDG model is built on the dedi-cated medical-domain knowledge graph and further requireevolving it to satisfy the need for the real-world diagnosis.

Meta-Learning.

By meta-training a model initializationfrom training tasks with the ability of fast adaptation to newtasks, meta-learning (Finn, Abbeel, and Levine 2017; Zhouet al. 2019, 2020) has achieved promising results in manyNLP areas, such as machine translation (Gu et al. 2018),task-oriented dialogues (Qian and Yu 2019; Mi et al. 2019),and text classiﬁcation (2019; 2019). But there is the feweffort to devote meta-learning into MDS, which requiresgrounding on the external medical knowledge and reasoningfor disease-symptom correlations. In this work, we employthe Reptile (Nichol, Achiam, and Schulman 2018), one ﬁrst-order model-agnostic meta learning approach, because of itsefﬁciency and effectiveness, and enhance it with the meta-knowledge graph reasoning and evolving.

Grounded on the external medical knowledge graph A , themedical dialogue generation models take the dialogue context U = { u , . . . , u t − } as input and aim to (1) generate thenext response R = u t and (2) predict the disease or symptomentity E = e t appearing in the next response as: f θ ( R, E | U, A ; θ ) = p ( u t , e t | u t , A ; θ ) , (1)Given the abundant dialogue examples of K differentsource diseases S k , the task of low-resource MDG requireto obtain a good model initialization during meta-trainingprocess: θ meta : ( U, A ) × S k → ( R source , E ) . (2)or the adaptation to the new target disease T , we ﬁne-tunethe model θ meta with minimal dialogue examples (e.g., 1% ∼

10% of the source disease) and require the induced model θ target to perform well in the target disease: θ target : ( U, A ) × T → ( R target , E ) . (3) In this section, we elaborate our end-to-end dialogue modelwhose framework is illustrated in Fig. 2. The proposed ap-proach integrates three components seamlessly, includinghierarchical context encoder, meta-knowledge graph reason-ing (MGR) and graph-guided response generator. Concretely,the context encoder ﬁrst encodes the conversation history intohierarchical context representations. Then MGR incorporatesthe obtained representations into the knowledge graph reason-ing process for the comprehension of the disease-symptomcorrelations. Finally, the graph-guided decoder generates in-formative responses via a well-designed copy mechanismover graph entity nodes. We will introduce them in turn.

We ﬁrst utilize a hierarchical context encoder (Serban et al.2016) to encode the dialogue history and obtain the hierarchi-cal hidden representations of the context. Formally, given adialogue context U = ( u , . . . , u l ) , the hierarchical contextencoder ﬁrst exploits a long short term memory (LSTM) net-work to encode each utterance into a hidden representation: h ui = LSTM θ u ( e i , . . . , e ij , . . . , e il i ) , (4)where e ij is the embedding of the j -th token in the i -th utter-ance. Then these hidden representations { h ui , i = 1 , · · · , l } of utterances are fed into another LSTM to obtain the repre-sentation of the entire dialogue history as h dial = LSTM θ d ( h u , . . . , h uj , . . . , h ul ) . (5)After obtaining utterance-level and dialogue-level representa-tions, as shown in Fig. 2, we use h ui to initialize the utterancenode features of the knowledge graph in Sec. 4.2, and thenadopt h dial as the initial state of the decoder LSTM in Sec. 4.3. Based on the obtained utterance representation h ui , we needto learn the disease-symptom correlations and further inquirythe patient the existence of related symptoms to verify. To thisend, we devise a meta-knowledge graph reasoning (MGR)network to learn and reason the above correlations. In prac-tice, one often has a prior commonsense disease-symptomgraph which roughly contains such correlations, e.g. , cold being indicated by the symptom cough . Our MGR aimsto (1) reason the correlations over diseases and symptomsthrough conversations with patients, (2) predict the possi-ble symptoms in the next inquiry/response to the patient,(3) evolve this commonsense disease-symptom graph into ameta-knowledge graph with the graph evolving meta-learning(GEML) framework. Here we focus on introducing the ﬁrsttwo points and present our GEML in Sec. 5. In practice, the commonsense disease-symptom graph canbe derived from the Chinese Symptom Library in OpenKG .The library contains a huge amount of triples, e.g. , ( Diar-rhea , related symptom , Gastroenteritis ). Formally, we de-note the commonsense graph as G = ( V e , A , X ) , where V e = { v e , . . . , v em } is the set of entity nodes, A is the cor-responding adjacency matrix and X ∈ R |V e |× F is the nodefeature matrix ( F is the number of features in each node). Inthe graph G , each entity node v ei ∈ V e denotes a symptomor disease. The feature vector of each entity node, i.e., eachrow of the feature matrix X , is trainable. Besides, we have autterance node set denoted as V u = { v u , . . . , v ul } , where theinput feature of each utterance node v ui is initialized by therepresentation h ui obtained in Eqn. (4). To incorporate thecontext information into the knowledge graph reasoning, weconnect each utterance node with all entity nodes it included.Now we introduce the graph reasoning process over dis-eases and symptoms. To boost the information propagationamong entity nodes, we build a meta-knowledge graph whereeach entity node indicates a disease or symptom. Inspiredby the graph attention network (Veliˇckovi´c et al. 2018), wedevise the meta-knowledge graph reasoning (MGR) networkthat consists of two graph reasoning layers. In the ﬁrst layer,entity nodes that occur in the dialogue history are activatedby aggregating the information from their corresponding ut-terance nodes. Then in the second layer, these activated entitynodes diffuse the information to their neighborhood nodes forthe correlation reasoning. Next we present the single graphreasoning layer that used to construct the MGR (throughstacking this layer). Let N i be the neighbor set of the node i according to the adjacency matrix X . With the input feature h ej of some neighborhood nodes j ∈ N i , the graph reasoninglayer updates the representation of node i as: h ei = σ (cid:0)(cid:80) j ∈N i α ij W h ej (cid:1) α ij = softmax j ( e ij ) = exp( e ij ) / (cid:80) k ∈N i exp( e ik ) (6)where W ∈ R F × F is a weight matrix and e ij is the atten-tion coefﬁcient that indicates the importance of entity node j to node i . Following (Bahdanau, Cho, and Bengio 2014),the attention coefﬁcient e ij is computed as e ij = Sigmoid( a T W [ h ei || h ej ]) , (7)where a ∈ R H × is a trainable vector, W ∈ R H × F is aweight matrix and || indicates the concatenation. Note weinject the graph structure (i.e. the adjacency matrix A ) intothe graph reasoning layer as we only compute the e ij forneighborhood nodes j of i . In Sec. 5.2, we will elaboratehow to evolve the meta-knowledge graph structure in a meta-learning paradigm. By stacking two graph reasoning layers,each entity node could grasp enough information from otherrelated nodes. As shown in Fig. 2, we then feed the ﬁnalentity node representations { h ei , i = 1 , · · · , m } into the re-sponse generator to infer possible entities in the next-turnresponse. To this end, we introduce the entity prediction taskbeyond response generation. Concretely, we feed the ﬁnal OpenKG is a Chinese open knowledge graph project. The li-brary is available at http://openkg.cn/dataset/symptom-in-chinese. ialogue History

Hierarchical Context Encoder

Graph-Guided Decoder

Graph Evolving D : You are likely to have the ileus . Meta-Knowledge Graph Reasoning

Commonsense graphGlobal-symptom graph P : Hello, doctor. I have a stomachache recently. What could be wrong with me? D : Do you feel heartburn ? P: No. P : Yes. I also feel loss of appetite , and often hear borborygmus . 𝑈 𝑈 𝑈 ......... 𝑈 𝑀 𝑢 𝑢 𝑢 𝑀 SymptomNode DiseaseNodeUtterance Node ℎ 𝑖𝑢 ℎ 𝑑𝑖𝑎𝑙 … Meta-train

Hierarchical Context EncoderMeta-Knowledge Graph ReasoningGraph-Guided Decoder

Adapt

Source diseases

End-to-end dialogue model

Layer 1 Layer 2

Stomach UlcerEsophagitisCold

Target diseases

PneumoniaIleusGastritis

Commonsense graph

Figure 2: Framework overview.

Upper:

The overview of the proposed GEML-MGR for the low-resource medical dialoguegeneration. The GEML-MGR ﬁrst goes through the meta-training phase to learn and evolve a meta-knowledge graph andthen adapts to new target diseases.

Lower:

The architecture of the end-to-end medical dialogue model, which integrates threecomponents seamlessly: hierarchical context encoder, meta-knowledge graph reasoning and graph-guided response generator.node representations { h ei , i = 1 , · · · , m } into a feed-forwardlayer and predict possible entities in the next response via thebinary classiﬁcation over all graph entity nodes. In this way,our MGR network can mine and reason the disease-symptomcorrelations, and thus predict underlying entities in the nextresponse to diagnose more accurately. To incorporate the knowledge graph into the generation, wedevise a graph-guided response generator with a copy mecha-nism adapted from (See, Liu, and Manning 2017). The mainmodiﬁcation is that we apply the copy mechanism over thegraph nodes distribution instead of the input source. Moreconcretely, under the guidance of the entity node representa-tions { h ei , i = 1 , · · · , m } , the decoder generates each word atthe time step t via sampling from the vocabulary or copyingdirectly from the graph entity nodes set E as: P ( t ) out = g t · P ( t ) V + (1 − g t ) · P ( t ) E , (8) where P ( t ) V is the normal vocabulary distribution from thedecoder LSTM; P ( t ) E is the attention distribution over graphentity nodes. The soft switch g t ∈ [0 , to choose betweensampling or copying is calculated given the decoder input x t and the decoder state s t as: g t = σ ( W · [ x t ; s t ; h at ]) with h at = (cid:80) i α ei · h ei , (9)where W is a trainable matrix and σ is the Sigmoid function.The aggregation vector h at is computed through the weightedsum over the node representations h ei , and α ei is the attentionweight calculated as (Bahdanau, Cho, and Bengio 2014). With the above graph-guided copy mechanism, the responsegenerator can achieve the more accurate symptom requestand disease diagnosis result. In this section, we present a Graph-Evolving Meta-Learning(GEML) framework which helps the above end-to-end medi-cal dialogue model to handle the low-resource setting. Thissetting is more practical and challenging since many diseasesin the real world are rare and costly to annotate as men-tioned in Sec. 1. To address this challenge, GEML uses meta-knowledge transfer and meta-knowledge graph evolving totransfer the diagnostic experience across different diseases.We’ll introduce them in turn.

The methodology of meta-knowledge transfer is to meta-trainan end-to-end medical dialogue model f θ meta parameterizedby θ meta with a fast adaptation capacity to new diseases withonly limited data. To this end, we follow the meta-learningframework and use existing dialogue data of k source diseasesto create a task set T = {{T i } N i =1 , {T i } N i =1 . . . , {T ki } N k i =1 } ,where each task T ki represents generating responses to ahandful of dialogues in the k -th disease. Each task T ki ∈ T has only a few dialogue samples, which can be furthersplit into the training (support) set D T i tr and the validation(query) set D T i va . Then in the meta-training stage, given amodel initialization θ meta , we require that θ meta can fastadapt to any task T i ∈ T through one gradient update: θ i = θ meta − β ∇ θ L D T itr ( f θ meta ) , (10)here L D T itr is the training loss function of task T i and β de-notes a learning rate. To measure the quality of the adapted pa-rameter θ i , MAML (Finn, Abbeel, and Levine 2017), whichis an optimization-based meta-learning approach, requires θ i to have small validation loss on the validation set D T i va . Inthis way, it can compute the gradient of validation loss andupdate the initialization θ meta as θ meta = θ meta − γ ∇L D T iva (cid:0) θ meta − β ∇ θ L D T itr ( f θ meta ) (cid:1) , (11)where γ is step size. To alleviate the computational cost forthe second-order gradient, i.e. Hessian matrix, in Eqn. (11),Reptile (Nichol, Achiam, and Schulman 2018) approximatesthe second derivatives of the validation loss as θ meta ← θ meta + γ |{T i }| (cid:80) T i ∼ p ( T ) ( θ i − θ meta ) . (12)In this work, we use Reptile to update the initialization θ meta because of its effectiveness and efﬁciency. After obtainingthe initialization θ meta , given a new target disease with onlya few training data D tr , we can adapt the model f θ meta withinitialization θ meta to the disease quickly via a few gradientsteps to obtain the disease-adapted parameters. This fast adap-tation ability comes from that, in the meta-training phase, wehave already simulated the fast learning to a new disease viafew steps of gradient descent on few validation data.Note this meta-knowledge transfer only considers the fastadaptation in terms of model parameters and ignores the ﬂawof the sparsity in the commonsense graph. To address thisproblem, we devise a graph evolving approach to evolve thecommonsense graph such that it can be tailored to the currentdisease and integrated with the dialogue instances better. Since the commonsense graph is sparse and does not coverenough symptom entities, there is a gap between this priorgraph and the real dialogue examples. For instance, “dysbac-teriosis” may appear in the consultation of a patient whileit doesn’t exist in the commonsense graph since it is com-paratively rare. To address the challenge, we propose toevolve the commonsense graph capitalized on the dialogueinstances and learn the induced meta-knowledge graph dur-ing the meta-training and adaptation phases. Inspired by Linet al. (2019) that shows the related symptom entities havea certain probability of co-occurrence in the same dialogue,we construct a global-symptom graph G ∗ = ( V ∗ , A ∗ , X ∗ ) ,where V ∗ = { v , . . . , v n } is the set of nodes, A ∗ is the cor-responding adjacency matrix and X ∗ ∈ R |V ∗ |× N is the nodefeature matrix. Concretely, the proposed approach ﬁrst col-lects all observed dialogue examples in an online manner.Then if two entities co-occur in a dialogue example, thereis an edge between both nodes in A ∗ . The meta-knowledgegraph is initialized with the adjacency matrix A of the priorcommonsense graph and updated as: A meta = A ⊕ A ∗ , (13)where ⊕ denotes the element-scale logic operator OR . Inthis way, updating the adjacency matrix A meta can reasonthe existence of edges among entity nodes. The structure of Dataset CMDD Chunyu

Table 1: Statistics of the CMDD dataset (Lin et al. 2019) andour Chunyu dataset. A ∗ is dynamically evolved along with more dialogue cases,which leads to the enrichment of the meta-knowledge graphsynchronously, i.e., adding more nodes and edges.The above approach for graph structure evolving caninfer the existence of disease-symptom correlations whileignoring its intensity. To characterize such relations moredelicately, GEML further learns the weight values of themeta-knowledge graph A meta with Eqn. (6) during the meta-training and adaptation phases (while not during the testing).Finally, GEML utilizes the cross-entropy loss of the entityprediction task (in Sec. 4.2) to guide the learning of A meta efﬁciently, and we denote it as L e . In this section, we’ll introduce the loss function for eachtask, i.e., L D T itr in Eqn. (10) in detail. The generation lossfunction is the negative log-likelihood of generating the re-sponse R = { r , . . . , r m } given the input dialogue context U = { u , . . . , u n } as: L g = − | R | (cid:80) | R | i =1 log p ( r i | U ; θ meta ) . (14)The ﬁnal training objective couples L g with L e and is of : L = L g + λ L e , (15)where the constant λ balances the loss L g and the entityprediction loss L e in Sec. 5.2. Here we conduct extensive experiments on the CMDD dataset(Lin et al. 2019) and the newly-collected Chunyu dataset todemonstrate the beneﬁts of GEML.

Datasets.

The CMDD dataset (Lin et al. 2019) has 2,067conversations totally ranging 4 pediatric diseases with ap-proximately equal counts, while neglects the data-imbalanceproblem among diseases. To pose the challenge, we collecta much larger medical dialogue dataset, namely Chunyu,which contains 15 diseases with comparatively distinct dataratios. As depicted in Fig. 1, the counts of each disease inChunyu are signiﬁcantly various and thus we can treat fourlow-resource diseases as target ones. The data statistics oftwo datasets are depicted in Table 1. The raw data of Chunyuis obtained from the Gastroenterology department of theChinese online health community

Chunyu . It contains 15gastrointestinal diseases and 62 symptoms in total. We usehand-crafted rules provided by doctors to label entities for ataset Method Automatic Metrics Human EvaluationTarget Disease 1 Target Disease 2 Target Disease 3 Target Disease 4 Average Knowledge GenerationBLEU Enti.-F1 BLEU Enti.-F1 BLEU Enti.-F1 BLEU Enti.-F1 BLEU Enti.-F1 Rationality QualityCMDD PT-NKD(2018) 39.16 25.71 30.1 16.39 32.02 12.5 30.64 22.22 32.98 19.21 2.38 2.5PT-POKS(2019) 41.51 33.33 12.7 35.88 33.27 22.53 31.06 27.45 29.63 29.79 2.87 2.92PT-MGR 43.96 36.36 31.31 18.19 37.78 23.89 31.95 28.33 36.25 26.69 3.26 3.28FT-NKD(2018) 42.72 28.57 30.32 31.74 34.67 21.69 32.45 31.58 35.04 28.4 3.13 3.42FT-POKS(2019) 41.8 42.78 32.25 37.15 35.36 25 32.56 25.97 35.49 32.72 3.06 2.94FT-MGR 45.23 45.78 36.18 38.81 34.59 23.08 33.07 29.36 37.27 34.26 3.39 3.56Meta-NKD 41.23 42.5 32.5 32.27 34.61 24.17 32.28 30.17 35.16 32.29 3.11 3.39Meta-POKS 40.64 35.14 34.25 37.14 36.84 27.69 33.7 28.25 36.35 32.06 3.31 3.3Meta-MGR 45.78 40.03

PT-NKD(2018) 16.01 4.54 20.75 18.75 13.17 11.27 17.45 17.54 16.84 13.03 2.78 3.26PT-POKS(2019) 15.24 10.13 21.34 18.46 15.25 12.99 18.9 21.65 17.68 15.82 2.81 2.93PT-MGR 15.5 7.5 25.42 21.91 18.13 13.95 19.46 29.76 19.62 25.63 3.11 3.29FT-NKD(2018) 16.35 4.87 19.52 20.68 15.28 18.18 18.49 26.74 17.41 17.62 2.89 3.37FT-POKS(2019) 14.46 17.24 21.63 32.16 16.45 23.08 18.18 27.32 17.68 24.95 3.12 2.96FT-MGR 18.38 28.57 25.61 38.88 19.53 22.27 20.41 30.15 20.98 29.97 3.29 3.17Meta-NKD 17.28 33.96 22.20 41.31 17.54 18.2 22.71 32.39 19.93 31.47 3.18 3.34Meta-POKS 17.87 23.18 24.76 42.86 16.46 23.35 16.71 22.22 18.96 27.9 3.12 3.19Meta-MGR 19.65 32.12 26.43 42.35

Table 2: Results on the two datasets in terms of automatic metrics ( × ) and human evaluation (on a 5-point scale). Top:

Forthe CMDD dataset, target diseases from 1 to 4 refer to “bronchitis”, “functional dyspepsia”, “infantile diarrhea” and “upperrespiratory infection” respectively.

Bottom:

For the Chunyu dataset, target diseases from 1 to 4 refer to “liver cirrhosis”, “ileus”,“pneumonia”, and “pancreatitis”.each instance. The instances with very few turns or entitiesand with private information have been all discarded.

Experimental Settings.

To apply the meta-learning, weconsider generating responses to a handful of dialoguesin one disease as a task. For Chunyu, as shown in Fig.1,high-resource diseases with more than 500 training instancesare treated as source diseases and the remaining four low-resource ones as target diseases, whose size of adaptationdata is ranging from 80 ∼ leave-one-out setup, i.e., using four diseases formeta-training and the one target disease left for adaptation(with the data size of 150 dialogues).All experiments are based on the AllenNLP toolkit (Gard-ner et al. 2018). We implement encoders and decoders witha single-layer LSTM (Hochreiter and Schmidhuber 1997),and use pkuseg (Luo et al. 2019) toolkit to segment Chinesewords. We set both dimensions of the hidden state and wordembedding to 300 for LSTM. Adam optimization is adoptedwith the initial learning rate of 0.005 and the mini-batch sizeof 16. The maximum training epochs are set to 100 withthe patience epoch of 10 for early-stopping. The best hyper-parameter λ to balance the generation loss and the entity lossis 8. All baselines share the same conﬁguration settings. Baselines.

We ﬁrst compare our base dialogue model MGRwith two knowledge-grounded dialogue systems, NKD (Liuet al. 2018) and POKS (Lian et al. 2019). NKD uses a neuralknowledge diffusion module to introduce relative entities intodialogue generation. PostKS employs both prior and poste-rior distributions over knowledge to select the appropriate knowledge in response generation. Then we introduce severalbaselines induced from our GEML framework.•

Pre-train Only.

We ﬁrst pre-train each dialogue base model f θ with source diseases data in a multi-task learningparadigm, and then test it directly on target diseases. Wetest the above three base models and denote them as PT-NKD (Liu et al. 2018), PT-POKS (Lian et al. 2019) andPT-MGR in Sec. 4. This is a zero-shot learning scenario.• Fine-tuning.

We pre-train f θ on source diseases with thesame multi-task learning paradigm and then ﬁne-tune thesepre-trained models on each target disease, which are de-noted as FT-NKD, FT-POKS and FT-MGR.• Meta Learning.

We ﬁrst meta-train three base dialoguemodels over source diseases with the effective meta-learning method, Reptile (Nichol, Achiam, and Schulman2018), and then adapt the derived meta-learners to eachtarget disease via ﬁne-tuning. The resulting models aredenoted as Meta-NKD, Meta-POKS and Meta-MGR.•

GEML-MGR.

We also employ our GEML framework onthe proposed MGR model and denote it as GEML-MGR.

Automatic Evaluation.

We adopt two automatic metrics forperformance comparisons as shown in Table 2. To evaluatethe generation quality, we utilize the average of the sentence-level BLEU-1, 2, 3 and 4 (Chen and Cherry 2014) and denoteit as BLEU. To evaluate the success rate in the entity predic-tion task, we adopt Entity-F1, namely the F1 score between ialogue History Ground-truth Response : Hello, doctor. I feel nauseous recently. At the same time, my stomach is dddsuffering from indigestion. Do I have any disease? 医生你好，我最近感觉恶心，同时肚子消化不良很难受，是得了啥病吗？ : Do you have a feeling of loss of appetite? 你有没有食欲不振的感觉？ : I didn't feel it, but I always felt a stomachache when I finished eating. 没有感觉，不过吃完饭的时候一直感觉肚子疼。 : Do you feel heartburn recently? 那最近有没有烧心的感觉？ : Yes, and I feel the barborygmus, always want to exhaust. 有的，而且肚子咕噜咕噜地响，总是想排气。 : I think you are likely to have an ileus. 你很可能是得了肠梗阻。

Generated Responses

Meta-NKD :

Do you feel bloating? 你有没有腹胀的感觉？

Meta-POKS :

I suspect you may have esophagitis. 我怀疑你可能患上了食管炎。

GEML-MGR (Ours) :

You should have an ileus. 你应该是得了肠梗阻。

Commonsense Graph Meta-Knowledge Graph

Ileus

Nausea

Esophagitis

HeartburnStomachache

Borborygmus

Vomit Loss of appetite

Ileus

NauseaHeartburnStomachache

Indigestion

Bloating

Enteritis

Vomit Loss of appetiteExhaustBloating

Graph

Evolving

Esophagitis

Indigestion 𝑃 𝐷 𝐷 𝑃 𝑃 𝐷 𝐷 𝐷 𝐷 Enteritis

Figure 3: The visualization of the evolved meta-knowledge graph and examples of generated responses on Chunyu. Our graphevolving can enrich the commonsense graph and generates the response containing the correct entity (i.e., ileus ). Graph Copy Meta- Graph CMDD ChunyuReasoning Mecha. Transfer Evolving BLEU Enti.-F1 BLEU Enti.-F1 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Table 3: Results of ablation studies on two datasets ( × ).predicted entities in generated response and the ground-truthentities. For the CMDD dataset, comparing to two other basemodels (NKD and POKS), our MGR always achieves thebest performance in terms of both automatic and human eval-uation, indicating the superiority of our end-to-end medicaldialog model. The Fine-tuning method exceeds the

Pretrain-Only in most cases and

Meta-Learning methods often out-perform multi-task learning in terms of BLEU slightly yetEntity-F1 signiﬁcantly. This means that the Reptile algorithmcan boost the capability of knowledge reasoning and trans-fer over diseases. By integrating our GEML method intothe MGR, we can observe signiﬁcant improvement for our

GEML-MGR against all baselines, especially on Entity-F1,which demonstrates the stronger knowledge reasoning abil-ity of our model in the medical diagnosis scenario. For theChunyu dataset, we can observe similar results, including thesuperiority of the proposed GEML-MGR approach. Besides,we can see that the BLEU scores of all methods in CMDDdataset are much higher than those in the Chunyu, whichdemonstrates the challenge of the low-resource setting.

Human Evaluation.

We invited ﬁve well-educated gradu-ate students majoring in medicine to score 100 generatedreplies for each method. For each dataset, the evaluators arerequested to grade each case in terms of “knowledge rational-ity” and “generation quality” independently ranging from 1(strongly bad) to 5 (strongly good). The right part of Table2 shows that our

GEML-MGR achieves the statistically sig-niﬁcant higher scores than

Meta-MGR (t-test, p <

FT-MGR (t-test, p <

Ablation studies.

To verify the effects of the main compo-nents of our GEML and the base dialogue model, we con-ducted ablation studies on the two datasets. Table 3 showsthat all these factors beneﬁt our approach. Additionally, whenwe drop the graph reasoning module or the graph-guided copymechanism, there were the remarkable performance degra-dation on both datasets, which indicates the signiﬁcance ofintegrating these components completely.

Case Study of Graph Evolving.

Fig. 3 shows the visu-alization of the evolved meta-knowledge graph and casesof generated responses. One can observe an signiﬁcant gapbetween the commonsense graph and conversations in theChunyu dataset, as the graph cannot cover all entities inthe dialogue, e.g. borborygmus and exhaust . Through graphevolving, the learnt meta-knowledge graph is enriched withnew entities and edges that can be derived from the conver-sation. For instance, the meta-knowledge graph absorbednew entities borborygmus and exhaust and enhance the edgeweights among the disease node ileus and its neighbor nodes.Additionally, for generated responses, our GEML-MGR pro-duces the rational and ﬂuent diagnosis response with the leastdialogue turns over other methods.

In this work, we propose an end-to-end low-resource medicaldialogue generation model which meta-learns a model initial-ization from source diseases with the ability of fast adaptationto new diseases. Moreover, we develop a Graph-EvolvingMeta-Learning (GEML) framework that learns to fast evolvea meta-knowledge graph for adapting to new diseases andreasoning the disease-symptom correlations. Accordingly,our dialogue generation model enjoys the fast learning abilityand can well handle low-resource medical dialogue tasks.Experiment results testify the advantages of our approach. cknowledgement

This work was supported in part by National Natural ScienceFoundation of China (NSFC) under Grant No.U19A2073and No.61976233, Guangdong Province Basic and Ap-plied Basic Research (Regional Joint Fund-Key) GrantNo.2019B1515120039, Nature Science Foundation of Shen-zhen Under Grant No. 2019191361, Zhijiang Lab’s OpenFund (No. 2020AA3AB14) and CSIG Young Fellow SupportFund.

References

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machinetranslation by jointly learning to align and translate. arXivpreprint arXiv:1409.0473 .Bansal, T.; Jha, R.; and McCallum, A. 2019. Learning to Few-Shot Learn Across Diverse Natural Language ClassiﬁcationTasks. arXiv preprint arXiv:1911.03863 .Chen, B.; and Cherry, C. 2014. A systematic comparison ofsmoothing techniques for sentence-level bleu. In

Proceedingsof the Ninth Workshop on Statistical Machine Translation ,362–367.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Dhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.-N.; Ahmed,F.; and Deng, L. 2017. Towards End-to-End ReinforcementLearning of Dialogue Agents for Information Access. In

ACL , 484–495.Dou, Z.-Y.; Yu, K.; and Anastasopoulos, A. 2019. Investi-gating Meta-Learning Algorithms for Low-Resource NaturalLanguage Understanding Tasks. In

EMNLP-IJCNLP , 1192–1197.Du, N.; Chen, K.; Kannan, A.; Tran, L.; Chen, Y.; andShafran, I. 2019a. Extracting Symptoms and their Statusfrom Clinical Conversations. In

ACL , 915–925.Du, N.; Wang, M.; Tran, L.; Lee, G.; and Shafran, I. 2019b.Learning to Infer Entities, Properties and their Relations fromClinical Conversations. In

EMNLP-IJCNLP , 4978–4989.Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-AgnosticMeta-Learning for Fast Adaptation of Deep Networks. In

ICML , 1126–1135.Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi,P.; Liu, N.; Peters, M.; Schmitz, M.; and Zettlemoyer, L.2018. Allennlp: A deep semantic natural language processingplatform. arXiv preprint arXiv:1803.07640 .Gu, J.; Wang, Y.; Chen, Y.; Li, V. O. K.; and Cho, K. 2018.Meta-Learning for Low-Resource Neural Machine Transla-tion. In

EMNLP , 3622–3631.Hochreiter, S.; and Schmidhuber, J. 1997. Long short-termmemory.

Neural computation

AAAI , 2305–2313. Kim, B.; Ahn, J.; and Kim, G. 2020. Sequential LatentKnowledge Selection for Knowledge-Grounded Dialogue. arXiv preprint arXiv:2002.07510 .Li, X.; Chen, Y.-N.; Li, L.; Gao, J.; and Celikyilmaz, A. 2017.End-to-End Task-Completion Neural Dialogue Systems. In

IJCNLP , 733–743.Lian, R.; Xie, M.; Wang, F.; Peng, J.; and Wu, H. 2019.Learning to select knowledge for response generation in dia-log systems.

IJCAI .Lin, X.; He, X.; Chen, Q.; Tou, H.; Wei, Z.; and Chen, T.2019. Enhancing Dialogue Symptom Diagnosis with GlobalAttention and Symptom Graph. In

EMNLP-IJCNLP , 5032–5041.Liu, S.; Chen, H.; Ren, Z.; Feng, Y.; Liu, Q.; and Yin, D.2018. Knowledge Diffusion for Neural Dialogue Generation.In

ACL , 1489–1498.Liu, W.; Tang, J.; Qin, J.; Xu, L.; Li, Z.; and Liang, X.2020. MedDG: A Large-scale Medical Consultation Datasetfor Building Medical Dialogue System. arXiv preprintarXiv:2010.07497 .Luo, R.; Xu, J.; Zhang, Y.; Ren, X.; and Sun, X. 2019.PKUSEG: A Toolkit for Multi-Domain Chinese Word Seg-mentation.

CoRR abs/1906.11455. URL https://arxiv.org/abs/1906.11455.Mi, F.; Huang, M.; Zhang, J.; and Faltings, B. 2019. Meta-Learning for Low-resource Natural Language Generation inTask-oriented Dialogue Systems. In

IJCAI , 3151–3157.Moon, S.; Shah, P.; Kumar, A.; and Subba, R. 2019. Open-dialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In

ACL , 845–854.Nichol, A.; Achiam, J.; and Schulman, J. 2018. On ﬁrst-ordermeta-learning algorithms. arXiv preprint arXiv:1803.02999 .Obamuyide, A.; and Vlachos, A. 2019. Model-AgnosticMeta-Learning for Relation Classiﬁcation with Limited Su-pervision. In

ACL , 5873–5879.Qian, K.; and Yu, Z. 2019. Domain Adaptive Dialog Genera-tion via Meta Learning. In

ACL , 2639–2649.Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; andSutskever, I. 2019. Language models are unsupervised multi-task learners.

OpenAI blog

ACL , 1073–1083.Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; andPineau, J. 2016. Building end-to-end dialogue systems usinggenerative hierarchical neural network models. In

AAAI .Shi, X.; Hu, H.; Che, W.; Sun, Z.; Liu, T.; and Huang, J.2020. Understanding Medical Conversations with ScatteredKeyword Attention and Weak Supervision from Responses.In

AAAI , 8838–8845.Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2019. MASS:Masked Sequence to Sequence Pre-training for LanguageGeneration. In

ICML , 5926–5936.utskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In

Advances inneural information processing systems , 3104–3112.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attentionis all you need. In

Advances in neural information processingsystems , 5998–6008.Veliˇckovi´c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`o,P.; and Bengio, Y. 2018. Graph Attention Networks.

ICLR .Wei, Z.; Liu, Q.; Peng, B.; Tou, H.; Chen, T.; Huang, X.-J.; Wong, K.-F.; and Dai, X. 2018. Task-oriented dialoguesystem for automatic diagnosis. In

ACL , 201–207.Wu, J.; Xiong, W.; and Wang, W. Y. 2019. Learning to Learnand Predict: A Meta-Learning Approach for Multi-LabelClassiﬁcation. In

EMNLP-IJCNLP , 4354–4364.Xu, L.; Zhou, Q.; Gong, K.; Liang, X.; Tang, J.; and Lin,L. 2019. End-to-end knowledge-routed relational dialoguesystem for automatic diagnosis. In

AAAI , 7346–7353.Zhang, H.; Liu, Z.; Xiong, C.; and Liu, Z. 2020. Groundedconversation generation as guided traverses in commonsenseknowledge graphs. In

ACL , 2031–2043.Zhao, X.; Wu, W.; Tao, C.; Xu, C.; Zhao, D.; and Yan, R.2019. Low-Resource Knowledge-Grounded Dialogue Gener-ation. In

ICLR .Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; and Zhu,X. 2018. Commonsense knowledge aware conversation gen-eration with graph attention. In

IJCAI , 4623–4629.Zhou, P.; Yuan, X.; Xu, H.; and Yan, S. 2019. Efﬁcient MetaLearning via Minibatch Proximal Update. In

NeurIPS .Zhou, P.; Zou, Y.; Yuan, X.; Feng, J.; Xiong, C.; and Hoi, S. C.2020. Task Similarity Aware Meta Learning: Theory-inspiredImprovement on MAML. In4th Workshop on Meta-Learningat NeurIPS 2020