[PDF] Knowledge Grounded Conversational Symptom Detection with Graph Memory Networks

Abstract

In this work, we propose a novel goal-oriented dialog task, automatic symptom detection. We build a system that can interact with patients through dialog to detect and collect clinical symptoms automatically, which can save a doctor's time interviewing the patient. Given a set of explicit symptoms provided by the patient to initiate a dialog for diagnosing, the system is trained to collect implicit symptoms by asking questions, in order to collect more information for making an accurate diagnosis. After getting the reply from the patient for each question, the system also decides whether current information is enough for a human doctor to make a diagnosis. To achieve this goal, we propose two neural models and a training pipeline for the multi-step reasoning task. We also build a knowledge graph as additional inputs to further improve model performance. Experiments show that our model significantly outperforms the baseline by 4%, discovering 67% of implicit symptoms on average with a limited number of questions.

Full PDF

KKnowledge Grounded Conversational Symptom Detectionwith Graph Memory Networks

Hongyin Luo Shang-Wen Li ∗ James Glass MIT CSAIL Amazon AI [email protected], [email protected], [email protected]

Abstract

In this work, we propose a novel goal-orienteddialog task, automatic symptom detection. Webuild a system that can interact with patientsthrough dialog to detect and collect clinicalsymptoms automatically, which can save adoctor’s time interviewing the patient. Givena set of explicit symptoms provided by the pa-tient to initiate a dialog for diagnosing, the sys-tem is trained to collect implicit symptoms byasking questions, in order to collect more infor-mation for making an accurate diagnosis. Af-ter getting the reply from the patient for eachquestion, the system also decides whether cur-rent information is enough for a human doc-tor to make a diagnosis. To achieve this goal,we propose two neural models and a trainingpipeline for the multi-step reasoning task. Wealso build a knowledge graph as additional in-puts to further improve model performance.Experiments show that our model signiﬁcantlyoutperforms the baseline by 4%, discovering67% of implicit symptoms on average with alimited number of questions.

In a typical clinical conversation between a patientand a doctor, the patient initiates the dialog byproviding a number of explicit symptoms as a self-report. Based on this information, the doctor asksabout other possible symptoms, in order to makean accurate diagnosis and suggest treatments. Thisis a multi-step reasoning process. At each step,the doctor choose a symptom to ask or concludesthe diagnosis by considering the dialog history andpossible diseases.With recent advances in deep reinforcementlearning (Mnih et al., 2013) and task-oriented dia-log systems (Bordes et al., 2016; Wen et al., 2016),recent studies have proposed human-computer di-alog systems for automatic diagnosis (Wei et al., ∗ Work done before the second author joined Amazon a r X i v : . [ c s . C L ] J a n . Automatic Diagnosis (AD)User: The baby has a Runny Nose.

Agent:

Does the baby cough?

User:

Yes, the baby is coughing.

Agent:

Does the baby have a fever?

User:

I am not sure.

Agent:

It is upper respiratory infection

II. Automatic Symptom Detection (ASD)User:

The baby has a Runny Nose.

Agent:

Does the baby cough?

User:

Yes, the baby is coughing.

Agent:

Does the baby sneeze?

User:

Yes, the baby is sneezing.

Agent:

Does the baby have a fever?

User:

I am not sure.

Agent:

Does the baby have a headache?

User:

Yes, the baby has a headache.

Agent:

Does the baby have Phlegm?

User:

Yes, the baby has Phlegm.

Agent:

Thank you for the information!A report has been sent to your doctor.

Table 1: Two examples of dialog between different sys-tems and a user. Conversation I is generated by an au-tomatic diagnosis system, and conversation II is gener-ated by an automatic symptom detection system. Theexplicit symptom is highlighted in blue, the implicitsymptoms are highlighted in red, and unrelated symp-toms are marked in green. the conversation when after deciding that currentinformation is enough for a doctor to make diag-nosis. As shown in Table 1, the number of turnsof the ASD system is possibly more than an ADsystem, and it covers more implicit symptoms thatare not mentioned in the patient’s self-report.In this work, we focus on the conversationalASD task. We propose a system that predicts im-plicit symptoms and whether to conclude the con-versation with neural networks. To train the neuralnetworks, we borrow the idea of the masked lan-guage model (Devlin et al., 2018) and simulateboth training and test datasets. To improve theperformance of the system, we annotate a medi-cal knowledge graph based on an online medicaldictionary. Then we propose a graph memory net-work (GMemNN) architecture to utilize the exter-nal knowledge graph. We also propose two metrics:symptom hit rate and unrelate rate to evaluate theperformance of the system. We make following contributions in this paper,• We propose the conversational symptom de-tection task and evaluation metrics.• We annotate a knowledge graph in the medicaldomain to enrich the current corpus.• We propose a graph memory network(GMemNN) architecture to build the dialogagent, which produces the state-of-the-art per-formance.

Task-oriented dialog systems aim at completing aspeciﬁc task by interacting with users through nat-ural language, and the main challenge is learning adialog policy manager (Papineni et al., 2001). Typ-ical applications include ﬂight booking (Seneff andPolifroni, 2000), movie recommendation (Dodgeet al., 2015; Fazel-Zarandi et al., 2017), restaurantreservation (Bordes et al., 2016), and vision ground-ing (Chattopadhyay et al., 2017). Recently, suchsystems have been applied in automatic diagnosis(Wei et al., 2018; Xu et al., 2019; Luo et al., 2020).The authors of De Vries et al. (2017) proposedthe GuessWhat game, which requires computersto guess a visual object given a natural languagedescription by asking a series of questions. TheGuessWhat game is similar with our task in themedical domain.

Many tasks require processing knowledge in dif-ferent formats. Sukhbaatar et al. (2015) proposedmemory networks (MemNNs) for question answer-ing. The context of the question, or knowledge, isstored in an external memory bank and the modelreads information from the memory with an at-tention mechanism. The MemNN model is alsoapplied in question answering in the movie domain(Miller et al., 2016), video question answering (Luoet al., 2019), and stance detection (Mohtarami et al.,2018). The neural Turing machine (Graves et al.,2014) and the neural computer (Graves et al., 2016)also applied external memory banks, and enablethe models to write into and read from the externalmemory cells dynamically.In many tasks, knowledge can be organized asgraphs. Recent studies have proposed differentneural models for processing graph-structured data.The graph neural networks (GNNs) (Scarselli et al.,2008) uses neural networks to perform messageropagation on graphs. The graph convolutionalnetworks (GCNs) (Kipf and Welling, 2016) em-ployed a multi-layer architecture to learn nodeembeddings by integrating the information of thenodes and their neighbors. The graph attentionnetworks (Veliˇckovi´c et al., 2017) integrates nodeembeddings with an attention mechanism. Shanget al. (2019) proposed a graph augmented memorynetwork (GAMENet) model for medication recom-mendation. A similar idea that combines graphsand memory networks is proposed in Pham et al.(2018) for molecular activity prediction. In thiswork, we also propose a memory network archi-tecture that processes graph-structured knowledge,but focus on bipartite graphs.

In this section, we formally deﬁne the automaticsymptom detection task and describe the corpusused to train and evaluate the model. We ﬁrst in-troduce the Muzhi corpus (Wei et al., 2018), thendescribe the task based on the corpus. Lastly, wedescribe the medical knowledge graph we anno-tated and the annotation method.

We train and evaluate our models using the Muzhicorpus. The corpus was collected from a onlinemedical forum , including 4 common diseases and66 symptoms. The corpus contains 710 dialogsessions represented as 710 user goals. Each usergoal includes a set of explicit symptoms as theuser’s self report, and a set of implicit symptomsqueried by doctors. An example of a user goal isshown in Table 2.In the example, 1 means that the patient con-ﬁrms a symptom, while 0 means that the patient isconﬁdent that the symptom does not exist. Othersymptoms not listed in the user goal are consideredeither unrelated to the diagnosis, or the patient isnot sure about their existence. In the Muzhi corpus,each user goal contains . explicit symptoms and . implicit symptoms on average. The goal of the automatic conversational symptomsdetection (ASD) task is detecting as many implicitsymptoms as possible through dialogs with the pa-tients, limiting the number of dialog turns. The http://muzhi.baidu.com Disease tag

Bronchiolitis

Exp Sym

Runny Nose: 1 Cough: 1

Imp Sym

Sore Throat: 1 Emesis: 0Harsh Breath: 1 Fever: 0

Table 2: An example of a user goal in the Muzhi corpus,containing explicit symptoms and implicit symptoms. 1means a symptom is conﬁrmed by the patient, while 0means that a symptom is denied by the patient. initial input of a dialog agent is the set of explicitsymptoms. Based on the query and user responseof each step, the system decides a new symptom toask, or stop the dialog.All implicit symptoms, including the positiveand negative ones, are considered as the target ofthe system. The user goals are collected from realdoctor-patient conversation, so we consider everyqueried symptom a necessary step of making anaccurate diagnosis. The systems are evaluated withtwo metrics. We say model A outperforms modelB if model A discovers more implicit symptoms,and queries less unrelated symptoms.

We annotate a medical knowledge graph to provideinformation about the relations among symptomsand diseases based on the symptoms included inthe Muzhi corpus. As described above, we have66 symptoms in total. We regard each symptomand disease as a node in the graph and annotatesymptom-symptom and symptom-disease edgesbased on the A-Hospital website, which containswebpages for both symptoms and diseases.We propose a novel annotation method to buildthe medical knowledge graph considering compli-cations. The symptom pages in A-Hospital de-scribes a series of diseases that can cause a symp-tom. Meanwhile, it also listed most possible symp-toms to appear if the target symptom is caused bya certain disease. We regard these symptoms ascomplications and make use of this information. Inpractice, we annotate the knowledge graph with thefollowing method,1. For each symptom s and its related disease d ,add edge ( s, d ) .2. For each symptom s , its related disease d , andcomplication c , add edge ( s, c ) .An example of the annotated knowledge isshown in Figure 1, and Table 3 summaries the oughCommonCold FeverHarshBreath PhlegmAsthmaPneumoniaBronchitisBronchialAsthma CardiacAsthmaRunnyNose SweatingCold Handand FeetChestPain Figure 1: An example of an annotated symptom in theknowledge graph. Red blocks represent symptoms andblue blocks stands for disease. “Cough” is the targetsymptom and other symptoms are complications.

Items Statistics

Num. Sym. 66Num. Dise. 28Num. Edge 1094Num. S-D Edge 284Num. S-C Edge 810

Table 3: A statistics of the annotated knowledgegraph of symptoms, diseases, and complications. Notethat both symptom-disease and symptom-complicationedges exist. knowledge graph we annotated. In the table, S-D edge stands for symptom-disease edge and S-Cedge stands for symptom-disease-symptom edge.The number of S-C edge is lower than multiplyingthe number of symptoms per disease and diseasesper symptom is that only a subset of symptomscaused by a disease are regarded as signiﬁcant com-plications of a given symptom.

In this section, we introduce the structure andpipeline of the proposed automatic symptom detec-tion system, including dialog state representation,the neural models for predicting symptoms and dia-log actions, the training strategy, and the evaluationmetrics.

Automatic symptom detection is a multi-step rea-soning task handled by action and symptom pre-dictions. Both tasks are accomplished with neuralnetworks based on the current dialog state. The ﬁrst step of building such a system is rep-resenting dialog states with vectors that can beprocessed by the neural networks. Following themethod applied in Wei et al. (2018) for vectorizingthe dialog states, each dialog state consists of 4parts:

I. UserAction : The user action of the previousdialog turn. Possible actions are:•

SelfReport : A user sends a self-report con-taining a set of explicit symptoms.•

Conﬁrm : A user conﬁrms that a queriedsymptom exists.•

Deny : A user indicates that a queried symp-tom does not exist.•

NotSure : A user replies “not sure” when anunrelated symptom is queried.

II. AgentAction : The previous action of the dialogagent. Possible actions are:•

Initiate : The system initiate the dialog andask the user to send the self-rport.•

Request : The system query about the exis-tence of a symtom.

III. Slots : Contains all symptoms appeared in thedialog history and their status. Each symptom has4 possible status,•

Conﬁrmed : Conﬁrmed by the user.•

Denied : Denied by the user.•

Unrelated : The symptom is not necessary forthe doctor to make an accurate diagnosis.•

NotQueried : A symptom has not beenqueried by the agent.

IV. NumTurns : Indicates the length of the dialoghistory, in other words, current number of turns.In each step, only one value is selected for User-Action, AgentAction, and NumTurns, and we repre-sent them with one-hot vectors a u , a r , and n respec-tively. We use a 66-dimension vector s to representthe Slots, where each dimension indicates the sta-tus of a symptom. If a symptom is conﬁrmed, thecorresponding dimension is set to . If a symptomis denied, the corresponding dimension is set to − .If a symptom is unrelated to the diagnosis process,and the dimension is set to − . All other dimen-sions are set to . The ﬁnal input of the neuralnetworks at the t -th step is represented as x t = [ a ut , a rt , n t , s t ] (1)which is genereted by concatenating all the vectorsdecribed above a) Initiate patient embeddingwith edges with input slots. Attention (b) Integrate disease informa-tion with attention.

Attention (c) Integrate symptom infor-mation with attention.

Linear DialogActionsSymptomQueriesLinear (d) Predict action and symptomwith linear transformations.

Figure 2: The 4 steps for processing an input dialog state with a graph memory network (GMemNN). The graynodes stand for patient, the red nodes represent symptoms, and the blue nodes represent diseases. The edges witharrows, which are labeled with same color as their source nodes, indicate the direction of message propagation.

The ﬁrst neural model we apply in this work is amulti-layer perceptron (MLP) with 1 hidden layer.The same neural network is applied in Wei et al.(2018) and Xu et al. (2019) for the automatic diag-nosis task. With input x , the feed forward processof the MLP is shown as follows, h = ReLU ( W · x + b ) y = Sof tmax ( W · h + b ) (2)where Softmax calculates probabilistic distributionby Sof tmax ( a i ) = e a i (cid:80) j e a j (3)The MLP is used for both implicit symptom anddialog action predictions. Note that the MLP modelonly uses the dialogs in the training set, and doesnot use the knowledge graph we annotated. Limited by the structure, MLPs cannot directlyutilize the knowledge graph, which contains nec-essary medical knowledge for clinical diagno-sis. Inspired by previous studies on processingknowledge and graphs (Sukhbaatar et al., 2015;Veliˇckovi´c et al., 2017), we propose graph mem-ory networks (GMemNN) that utilizes the medicalknowledge graph to improve the performance ofthe automatic symptom detection system.The knowledge graph is stored in an externalmemory bank. In each step, we regard a patientas a node connected with the known symptomsin the graph. Our purpose is to learn the embed-ding of the patient node and predict dialog actionsand symptoms based on it. The prediction usingGMemNN contains 4 steps: 1. encoding dialogstates, 2. integrating potential disease information, 3. integrating complication symptoms, and 4. pre-dicting action/symptom. The 4 steps are illumi-nated in Figure 2.

Dialog State Encoding

The GMemNN encodesthe input dialog states with a lookup matrix, or alinear transformation. Given an input dialog staterepresentation x , the network encodes the dialogstate with u = W x · x + b x (4)Note that no non-linear activation is applied on u at this step, and u is considered as the initialembedding of the patient node in the graph. Integrating Disease Information

After encodingthe dialog state, we update the patient embeddingusing the embeddings of possible diseases. Wecalculate an embedding to summarize potential dis-eases using the attention mechanism for readingfrom the memory bank applied in the memory net-works (Sukhbaatar et al., 2015).Similar with the method applied in the MemNN,we ﬁrst calculate two sets of embeddings for thediseases based on their neighbors, or related symp-toms, in the knowledge graph. In this paper, we use W sm to denote the symptom embedding matricesfor calculating attentions on memory, and W sc todenote the symptom embeddings for calculatingoutputs. The related symptoms are summarizedwith the adjacency matrix A d between symptomsand diseases. d i,m = d i,m + A id W sm D − d,i d i,c = d i,c + A id W sc D − d,i (5)where d i, · represents the updated embedding ofdisease i , d i, · is the initial disease embedding, W s · stands for the symptom embedding matrix for up-dating disease embeddings. A id is the i -th row of A d , and D d,i is the disease node degree for nor-malization. This is a variant of the normalizationmethod proposed in Kipf and Welling (2016).hen we summarize potential diseases using d m , d c , and the initial input embedding u . e d = (cid:88) i α di · d i,c α di = Sof tmax ( u · d i,m ) (6)Then we update the initial patient embedding u by integrating disease embeddings. u d = ReLU ( u + e d ) (7) Integrating Symptom Information

After inte-grating the information of possible diseases, themodel continues integrating the complication symp-tom information to produce the ﬁnal patient em-bedding. For symptom i , given the initial symptomembeddings s i, · , the adjacency matrix A s betweensymptom and symptom, we calculate symptom em-beddings with s i,m = s i,m + A is W sm D − s,i + A · ,id W dm D − d, · ,i s i,c = s i,c + A is W sc D − s,i + A · ,id W dc D − d, · ,i (8)where W s · is the complication symptom embed-ding matrix, W d · is the disease embedding matrix. D s,i is the number of neighbor symptoms of symp-tom i , and D d,i is the number of neighbor diseasesof symptom i .Similarly, we summarize the complication symp-toms by e s = (cid:88) i α si · s i,c α si = Sof tmax ( u d · s i,m ) (9)Then we get the ﬁnal patient embedding by in-tegrating u d with the complication symptoms em-bedding u d,s = ReLU ( u d + e s ) (10) u d,s stands for a patient embedding that has inte-grated both disease and symptom information. Action/Sympotom Prediction

The GMemNNmodel predicts both dialog actions and symptomswith linear transformations based on the same pa-tient embedding u d,s . y act = W act · u d,s + b act y sym = W sym · u d,s + b sym (11)The action and symptom distributions are calcu-lated with y act and y sym with the Softmax func-tion. The available dialog actions are Conclude andQuery, and the prediction space of the symptomprediction network is the 66 symptoms except theknown symptoms. The Muzhi dataset does not contain any dialog his-tory to mimic. Inspired by the masked languagemodel training pipeline proposed by Devlin et al.(2018), we construct our own training set by ran-domly masking and sampling symptoms.

Symptom Prediction

We build the training set bysimulating dialog states from user goals in the orig-inal training set of the Muzhi corpus. We con-sider user goal g i with explicit symptom set S e and implicit symptom set S i as an example, where | S e | = n e and | S i | = n i . We simulate t dialogstates based on g i with the following steps.• Select the entire explicit symptom set S e .• Randomly select n (cid:48) i ∈ [0 , n i ) and sample n (cid:48) i implicit symptoms to construct S (cid:48) i ⊂ S i • Randomly select n u ∈ [0 , T max − n (cid:48) i ) andsample n u unrelated symptoms to constructset S u . T max stands for the maximum numberof symptoms can be queried.• Set the number of turns with t = n (cid:48) i + n u .• If n (cid:48) i = n u = 0 , set AgentAction to “Initiate”.Else set the AgentAction to “Request”.• Randomly select a symptom s ∈ S i ∪ S u . If s ∈ S u , set UserAction to “NotSure”, else setit to “Conﬁrm” or “Deny” based on g i .• Set current slot to S e ∪ S (cid:48) i ∪ S u .• Randomly select a implicit symptom s l ∈ S i − S (cid:48) i as the prediction label. Action Prediction

We simulate dialog states forthe dialog action prediction task with the sameprocedure as described above, except that we caninvolve all implicit symptoms. If all implicit symp-toms are included, the training label will set to“Conclude”, otherwise the label will be “Query”.We train MLPs and GMemNNs on both tasksafter the training sets are generated. The modelsare trained with the simulated dialog states andlabels with the stochastic gradient descent (SGD)algorithm.

We train and evaluate our models on the Muzhicorpus. The symptom predictor and the dialogaction predictor are trained separately. Using thesame strategy of simulating the training set, wealso generated test sets for symptom prediction andaction prediction respectively using the test usergoals with the same method. The generated testsets are used for evaluating the performances of ourmodels on both unit tasks. nit Task Model Acc (%)

Stdv (%)ActionPrediction MLP 94.14 0.27GMemNN 94.50 0.42SymptomPrediction MLP 45.10 0.62GMemNN 47.88 1.18

Table 4: Unit task evaluation results of the action andsymptom prediction tasks. Acc stands for average accu-racy, and Stdv stands for the standard deviation of theaccuracies. The statistics are obtained by running 10experiments for each model on each task.

After evaluating the models in with the unit tasks,we conduct conversational evaluations using thetrained models and a user simulator. We evalu-ate the performance of the models by accountingthe number of implicit and unrelated symptomsqueried in the conversations.

For action prediction, we simulate 20 dialog statesfor each user goal in both training and test sets. Allsimulated states contain the entire explicit symp-tom sets. 10 of the 20 states also contain the com-plete implicit symptom sets, thus they are labeledwith “ ”, meaning that the dialog system shouldconclude the dialog given these states in a dialog.The other states only contains a proper subset ofimplicit symptoms. These states are labeled with“ ”, meaning that the agent should continue query-ing symptoms. We have 11,360 training states and2,840 test states.We train an MLP and a GMemNN model on thesimulated training sets. The MLP model has onehidden layer with 128 neurons, while the size ofthe hidden layers of GMemNN is set to 64. Themodels are trained with stochastic gradient descent(SGD) algorithm. The learning rate for training theMLP is 0.025, and is set to 0.035 for training theGMemNN. A weight decay rate of 0.001 is appliedfor training both models. Both models are trainedfor 40 epochs.The experimental results are shown in Table 4.All experimental results are obtained by running 5independent experiments for each model from datasimulation. The GMemNN model outperformedthe MLP model with a small margin. The experi-mental results indicated that action prediction is nota hard classiﬁcation task that external knowledgeand complex neural networks do not help much. For implicit symptom prediction, we simulate 10dialog states for each user goal in both training andtest sets. All dialog states contains the completeexplicit symptom set and a proper subset of im-plicit symptoms. A random number of unrelatedsymptoms are also included. The label for trainingset is randomly sampled from implicit symptomsthat are not included in the dialog state.We train the neural networks for the implicitsymptom prediction task with SGD. The architec-tures of MLP and GMemNN are the same as themodels applied for action prediction respectively.We also apply the same hyper-parameter settingsfor training as the previous task.The experimental results of symptom predictionare shown in Table 4, which are also collected byruning 5 independent experiments from data simu-lation. The GMemNN model signiﬁcantly outper-formed the basic MLP model by . on averageand the performance is more stable. Comparingwith the action prediction task, symptom predic-tion is much more difﬁcult. As a result, domainspeciﬁc knowledge can improve the performancemore signiﬁcantly. We also evaluate our model by conducting dialogsusing the original test split of user goals in theMuzhi corpus. For each test user goal, we generatea conversation using the dialog action predictor, theimplicit symptom predictor, and a rule-based usersimulator.The user simulator initiates a dialog by providinga set of explicit symptoms as the initiate state. Ineach dialog step, the action predictor decides if thecurrent state is informative enough to conclude thedialog. If a conclusion action is predicted, the sys-tem stops the conversation. Otherwise, the systemqueries the user simulator with a symptom selectedby the symptom predictor. If the selected symptomis positive in the implicit symptom set, the usersimulator conﬁrms the query. If it is negative inthe implicit symptom set, the user simulator deniesthe query. If the selected symptom is not in theimplicit, the user simulator responses “NotSure”.The dialog continues until the “Conclusion” actionis selected, or the maximum limit of dialog turns isreached.For each test user goal, we calculate the numberof unrelated symptoms queried N u , the number of odel Hit (%) UnRel (%) F1 (%)MLP-AD 9.62 83.37 18.75MLP-ASD 63.26 81.88 31.35GMemNN Table 5: The experimental results of the conversationalevaluation. MLP-AD stands for the pretrained state-of-the-art MLP model for automatic diagnosis (AD) pro-vided by the authors of Xu et al. (2019). MLP-ASDstands for the MLP model for automatic symptom de-tection (ASD) in this work. Hit stands for average hitrate R h , UnRel stands for average unrelated rate R u . dialog turns N , and the ratio of detected implicit R d . Given the number of all implicit symptoms N i and the number of the detected implicit symptoms N (cid:48) i , we calculate the hit rate R h , unrelated rate R u ,and the F1 score by R h = N (cid:48) i N i , R u = N u N , F = 2 R h (1 − R u ) R h + 1 − R u (12)We evaluate the models by calculating and com-paring R d , R u , and F1 score averaged by the num-ber of conversations. The experimental results areshown in Table 5.The experiments are conducted by setting the tol-erate rate (TolR) to 10, meaning allowing the agentto query up to 10 symptoms. The experimentalresults showed that the MLP-ASD and GMemNNmodels detected signiﬁcantly more implicit symp-toms than the MLP-AD model (Xu et al., 2019),which makes diagnosis by querying only . ofimplicit symptoms that a human doctor would askabout. Comparing the MLP-AD and GMemNNmodels, the GMemNN model signiﬁcantly outper-formed the MLP model by . hit rate with . lower unrelated rate. The improvement onF1 score is . .We use tolerate rate (TolR) to limit the num-ber of dialog turns. If the symptom predictor iscompletely random and the TolR equals to thenumber of symptoms, the hit rate R h will be . However, querying all symptoms costs toomuch time for the patient. Since the average num-ber of symptoms per user goal is . , the av-erage unrelate rate R u of such a system will be (66 − . /

66 = 95 . and the F1 score will beas low as . .To understand the effect of the tolerate rate, wevisualized the relation between R h , R u , and TolRin Figure 3. The plot indicates that increasing TolR MLP HitGMemNN HitMLP UnRelGMemNN UnRel

Figure 3: The effect of tolerate rate on hit rate and un-relate rate for the MLP and the GMemNN models. from 1 to 10 can signiﬁcantly improve the hit rates.However, the improvement vanishes after the 15thquery because having too many queried symptomsmakes the dialog states noisy. When the TolR isless than 10, the performance gap between TheMLP and GMemNN model is not as large as thecases where TolR is larger than 10. There are tworeasons for this phenomenon. I. some symptomsare queried by human doctors very frequently andthey are equally easy for both models to predict;II. The GMemNN has better ability to model andprocess noisy inputs.

In this work, we propose a new task: detectingimplicit symptoms of patient with an automaticdialog system. We construct the system with a dia-log action prediction module and a symptom querymodule. We ﬁrst implement and evaluate a baselinesystem based on multi-layer perceptrons (MLPs).To improve the performance of the system, weannotate a medical-domain knowledge graph andpropose the graph memory network (GMemNN)model. We systematically evaluate and compareboth models with unit tasks and conversations. Wealso studied how the number of dialog turns ef-fects the performance of the systems. Experimentsshowed that both models can detect more than implicit symptoms using limited turns of dialogs,which signiﬁcantly outperformed the state-of-the-art automatic diagnosis system. In future work, wewill expand the knowledge graph and aim to assisthuman doctors by making the clinical interviewprocess more efﬁcient. eferences

Antoine Bordes, Y-Lan Boureau, and Jason Weston.2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683 .Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu,Arjun Chandrasekaran, Abhishek Das, Stefan Lee,Dhruv Batra, and Devi Parikh. 2017. Evaluating vi-sual conversational agents via cooperative human-aigames. In

Fifth AAAI Conference on Human Com-putation and Crowdsourcing .Harm De Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron Courville.2017. Guesswhat?! visual object discovery throughmulti-modal dialogue. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 5503–5512.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Jesse Dodge, Andreea Gane, Xiang Zhang, AntoineBordes, Sumit Chopra, Alexander Miller, ArthurSzlam, and Jason Weston. 2015. Evaluating pre-requisite qualities for learning end-to-end dialog sys-tems. arXiv preprint arXiv:1511.06931 .Maryam Fazel-Zarandi, Shang-Wen Li, Jin Cao, JaredCasale, Peter Henderson, David Whitney, and Al-borz Geramifard. 2017. Learning robust dialog poli-cies in noisy environments.Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural turing machines. arXiv preprintarXiv:1410.5401 .Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi´nska, Sergio G´omez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou, et al.2016. Hybrid computing using a neural net-work with dynamic external memory.

Nature ,538(7626):471.Thomas N Kipf and Max Welling. 2016. Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 .Hongyin Luo, Shang-Wen Li, and James Glass. 2020.Prototypical q networks for automatic conversa-tional diagnosis and few-shot new disease adaption. arXiv preprint arXiv:2005.11153 .Hongyin Luo, Mitra Mohtarami, James Glass, KarthikKrishnamurthy, and Brigitte Richardson. 2019. In-tegrating video retrieval and moment detection in auniﬁed corpus for video question answering.

Proc.Interspeech 2019 , pages 599–603.Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly read-ing documents. arXiv preprint arXiv:1606.03126 . Volodymyr Mnih, Koray Kavukcuoglu, David Sil-ver, Alex Graves, Ioannis Antonoglou, Daan Wier-stra, and Martin Riedmiller. 2013. Playing atariwith deep reinforcement learning. arXiv preprintarXiv:1312.5602 .Mitra Mohtarami, Ramy Baly, James Glass, PreslavNakov, Llu´ıs M`arquez, and Alessandro Mos-chitti. 2018. Automatic stance detection usingend-to-end memory networks. arXiv preprintarXiv:1804.07581 .Kishore A Papineni, Salim Roukos, and Robert T Ward.2001. Natural language task-oriented dialog man-ager and method. US Patent 6,246,981.Trang Pham, Truyen Tran, and Svetha Venkatesh. 2018.Graph memory networks for molecular activity pre-diction. In , pages 639–644. IEEE.Franco Scarselli, Marco Gori, Ah Chung Tsoi, MarkusHagenbuchner, and Gabriele Monfardini. 2008. Thegraph neural network model.

IEEE Transactions onNeural Networks , 20(1):61–80.Stephanie Seneff and Joseph Polifroni. 2000. Dialoguemanagement in the mercury ﬂight reservation sys-tem. In

Proceedings of the 2000 ANLP/NAACLWorkshop on Conversational systems-Volume 3 ,pages 11–16. Association for Computational Lin-guistics.Junyuan Shang, Cao Xiao, Tengfei Ma, Hongyan Li,and Jimeng Sun. 2019. Gamenet: Graph augmentedmemory networks for recommending medicationcombination. In

Proceedings of the AAAI Confer-ence on Artiﬁcial Intelligence , volume 33, pages1126–1133.Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In

Advancesin neural information processing systems , pages2440–2448.Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2017. Graph attention networks. arXiv preprintarXiv:1710.10903 .Zhongyu Wei, Qianlong Liu, Baolin Peng, HuaixiaoTou, Ting Chen, Xuanjing Huang, Kam-Fai Wong,and Xiangying Dai. 2018. Task-oriented dialoguesystem for automatic diagnosis. In

Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers) ,pages 201–207.Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialoguesystem. arXiv preprint arXiv:1604.04562 .in Xu, Qixian Zhou, Ke Gong, Xiaodan Liang,Jianheng Tang, and Liang Lin. 2019. End-to-end knowledge-routed relational dialogue sys-tem for automatic diagnosis. arXiv preprintarXiv:1901.10623arXiv preprintarXiv:1901.10623