[PDF] Diagnosis of Acute Poisoning Using Explainable Artificial Intelligence

Abstract

Medical toxicology is the clinical specialty that treats the toxic effects of substances, be it an overdose, a medication error, or a scorpion sting. The volume of toxicological knowledge and research has, as with other medical specialties, outstripped the ability of the individual clinician to entirely master and stay current with it. The application of machine learning techniques to medical toxicology is challenging because initial treatment decisions are often based on a few pieces of textual data and rely heavily on prior knowledge. ML techniques often do not represent knowledge in a way that is transparent for the physician, raising barriers to usability. Rule-based systems and decision tree learning are more transparent approaches, but often generalize poorly and require expert curation to implement and maintain. Here, we construct a probabilistic logic network to represent a portion of the knowledge base of a medical toxicologist. Our approach transparently mimics the knowledge representation and clinical decision-making of practicing clinicians. The software, dubbed Tak, performs comparably to humans on straightforward cases and intermediate difficulty cases, but is outperformed by humans on challenging clinical cases. Tak outperforms a decision tree classifier at all levels of difficulty. Probabilistic logic provides one form of explainable artificial intelligence that may be more acceptable for use in healthcare, if it can achieve acceptable levels of performance.

Full PDF

aa r X i v : . [ c s . A I] F e b Diagnosis of Acute Poisoning Using Explainable ArtiﬁcialIntelligence

Michael Chary, MD PhD , ∗ , Ed W Boyer, MD PhD , Michele Burns, MD, MS Weill Cornell Medical Centre, NY, NY Brigham and Women’s Hospital, Boston, Massachusetts Boston Children’s Hospital, Boston, Massachusetts ∗ Corresponding author: [email protected]

Keywords—

Explainable Artiﬁcial Intelligence, Machine Learning, Medical Toxicology

Abstract

Medical toxicology is the clinical specialty that treats the toxic eﬀects of substances, forexample, an overdose, a medication error, or a scorpion sting. The volume of toxicologicalknowledge and research has, as with other medical specialties, outstripped the ability of theindividual clinician to entirely master and stay current with it. The application of machinelearning/artiﬁcial intelligence (ML/AI) techniques to medical toxicology is challenging becauseinitial treatment decisions are often based on a few pieces of textual data and rely heavilyon prior knowledge, experience, and expertise. ML/AI techniques, moreover, often do notrepresent knowledge in a way that is transparent for the physician, raising barriers to usability.Rule-based systems are more transparent approaches, but often generalize poorly and requireexpert curation to implement and maintain. Here, we construct a probabilistic logic network torepresent a portion of the knowledge base of a medical toxicologist. Our approach transparentlymimics the knowledge representation and clinical decision-making of practicing clinicians andrequires minimal maintenance. The software, dubbed

Tak , performs comparably to humanson straightforward cases and intermediate diﬃculty cases, but is outperformed by humans onchallenging clinical cases.

Tak outperforms a decision tree classiﬁer at all levels of diﬃculty.Probabilistic logic provides one form of explainable artiﬁcial intelligence that may be acceptablefor use in healthcare.

The goal of this project is to represent physician decision making in acute conditions in medicaltoxicology using explainable and transparent computational techniques. The scope of biomedicalknowledge is too vast and rate of increase of that knowledge too rapid for an individual physicianto bring all relevant knowledge to bear on the diagnosis and treatment of an illness. The traditionalresponse is the formation of teams of specialists, but this can promote fragmentation of care andincreases the cost of delivering healthcare as well the chances for miscommunication. Nor doesspecialization remove elements of human fatigue or latent bias. Machine learning and artiﬁcialintelligence (ML/AI) approaches are trained on data sets larger than any physician could encounter1n training. ML/AI algorithms can outperform physicians on speciﬁc tasks, such as predicting thelikelihood of response to a chemotherapeutic regimen [1] or diagnosing pneumonia from a chestX-ray [2], but perform dismally in poorly-deﬁned tasks such as constructing a diﬀerential diagnosis,a list of diagnoses ranked by the likelihood of explaining a patient’s current condition.A barrier to integrating ML/AI into healthcare is the diﬀerence between how ML/AI andphysicians evaluate clinical data. Many current ML/AI approaches look for quantitative patternsacross large data sets. They ignore prior knowledge ( i.e. information an experience physiciansacquire in medical school and residency). Pretrained models and transfer learning incorporatestatistical relationships from prior knowledge, but do not explicitly represent these relationships interms readily interpretable by physicians. Reliability plots and counterfactual reasoning provide away to understand the internal reasoning of algorithms that classify pictures of biopsies [3], butit is not clear how to apply these methods to textual data. Probabilistic logic provides a wayto combine statistical learning with symbolic reasoning, a way to combine machine capacity withhuman intuition.

The diagnosis and treatment of a poisoned patient begins with a rapid determination of whetherthe patient requires immediate intervention to prevent death. This determination is usually madeat the patient’s bedside by a physical examination and, if the patient’s mental status allows, a briefdiscussion with the patient. In these critical situations, laboratory tests ( e.g. serum or urine drugconcentrations) are rarely available readily enough to inform this determination. Opioids can slowbreathing within minutes of ingestion. The metabolites are not detected in urine for hours, but theeﬀect of the drug needs to be immediately reversed to prevent death from lack of oxygen. Serumconcentrations of opioids and sedatives often have to be sent to specialized labs and the results arenot available for days to weeks.The medical toxicologist relies on pattern of ﬁndings on bedside evaluation, termed toxidromes ,that suggest a life-threatening ingestion from a class of drugs, e.g. a sedative or a stimulant. Table1 displays the 6 canonical toxidromes, the features physicians extracts, and the feature values foreach toxidrome. The bedside evaluation a medical toxicologist performs resembles, in the languageof ML/AI, feature extraction and then multinomial classiﬁcation with a complex loss function.From this lens, a toxidrome is a set of ranges of values of features that deﬁne decision regions forclass membership. The word toxidrome refers to a decision region. Toxidromes are intended toaccurately identify severe poisonings that will respond to treatment, but may misclassify milderpoisonings. This misclassiﬁcation is acceptable clinically because mild poisonings, in general, donot require any speciﬁc immediate treatment.The names of the toxidromes reﬂect the biochemical pathways excessively activated or blocked byclasses of drugs. The anticholinergic toxidrome results from blockade of the acetylcholine receptorfamily; cholinergic toxidrome from activation of the acetylcholine receptor family; opioid toxidromefrom activation of the µ opioid receptors; sedative-hyponotic toxidrome from activation of theGABA ( γ -amino butyric acid) receptors or blockade of glutamate receptors, and the sympathomimetictoxidrome from activation at adrenaline or noradrenaline receptors [4]. Serotonin toxicity is thoughtto result from excess activation at the 5-HT A receptor[5]. The clinical ﬁndings from excessserotonin activation are canonically termed serotonin syndrome or serotonin toxicity, but the termis used equivalently to a toxidrome. 2R BP Pup Sec Temp RR MSAnticholinergic ⇑ • ⇓ ⇑ DCholinergic ⇓ • ⇑ ⇓ SOpioid • ⇓ SSedative-Hypnotic SSerotonin Toxicity ⇑ ⇑ ⇑

ASympathomimetic ⇑ ⇑ • ⇑ ⇑ ATable 1.

Six canonical toxidromes.

HR, heart rate; BP, blood pressure; Pup, pupil diameter,size of bullet represents increased or decreased pupil diameters; Sec, secretions; Temp, temperature;RR, respiratory rate; MS, mental status; D, delirious, S, sedated; A, agitated Empty cell indicatesexpectation of no abnormality for that sign.

Machine learning techniques ( e.g. naive Bayesian classiﬁers, neural networks, decision trees)have been applied to diagnosis in many ﬁelds of medicine[6]. Here we review the application ofprobabilistic logic networks that analyze text to perform medical diagnosis. We exclude algorithmsthat analyze images, as might be done in radiology or pathology. Images may be helpful inidentifying whether ﬂora or fauna are poisonous. Support vector machines have been productivelyapplied to the automatic identiﬁcation of poisonous mushrooms [7] and plants [8, 9] from images.But, image data are often not available to the toxicologist.A combination of Bayesian networks and an ontology has been used to diagnose osteoporosis,achieving a 72% accuracy [10]. A Markov logic network has been implemented for diagnosis ofmedical conditions from Chinese-language medical records[11], but there was no assessment of itsperformance. The construction of a fuzzy Bayesian network for medical diagnosis has also beenproposed but its performance not assessed [12].Software that combined rules and probabilities was developed for medicine as early as the 1970’s,for example

MYCIN [13]. Research on neural networks eclipsed work on rule-based systems becauseneural networks could operate with inexact matching, performed more accurately, and scaled moreeasily and rapidly. The use of neural networks in clinical practice is limited, however, owing, in part,to the diﬃculty of a clinician interacting with something that “doesn’t speak my language”[14].To the authors’ knowledge there have been no prior publications on the application of probabilisticlogic networks to the diagnosis of the poisoned patient. The Chemical Hazards and EmergencyMedical Management branch (CHEMM) of the US Department of Health and Human Serviceshas developed the CHEMM Intelligent Syndromes Tool. This tool is available via a web-basedinterface, but no application programming interface or other scalable endpoint is provided. Thereare no publications describing its implementation, but it appears to be based on FALCON[15], adeterministic decision tree to co-ordinate responses against attacks with chemical weapons.

Probabilistic logic networks (PLNs) aim to represent knowledge about the world and allowinference under uncertainty using a combination of predicate logic, symbolic reasoning, and statisticalinference[16]. A PLN consists of a set of pairs of a probability and a logical statement.3 . x ) ⇒ sedative_hypnotic ( x ) (1)Equation (1) represents the concept that in 40% of possible worlds if a patient is somnolent(excessively sleepy and lethargic) then the patient may be poisoned by a medication from thesedative/hypnotic class.The fraction associated with each statement represents the fraction of words in which the logicalstatement is true. The statement is assumed to be false in all other worlds. The SupplementaryMaterial provides more detail on the underlying mathematics and our software implementation.Rule-based systems need experts to create and curate the rules as well as to adapt the rulesto include new knowledge or apply the system to unfamiliar types of data. This need for curationmay limit the speed of development. It also provides an opportunity for physicians to contribute tosoftware development, which may increase use in clinical practice. For a more complete introductionto probabilistic logic we refer the reader to [17] and for software implementation to [18].A signiﬁcant advantage of rule-based approaches over other machine learning approaches formany areas of medicine is that rule-based approaches do not require large training data sets. Manysubspecialties within medicine diagnose and treat rare diseases for which large data sets to explore allmethods of diagnosis and treatment are unlikely to exist. The rules, in addition, can be a distillationof the received knowledge of a ﬁeld, or a combination of this distillation and relationships inferredfrom large data sets.Decision tree (DT) learning provides a competitive alternative to probabilisitic logic networks.DT classiﬁers have been used to predict the risk of breast cancer[19], heart disease[20], diagnosediabetes[21], and classify electroencephalogram outputs in patient with epilepsy[22]. DTs arerobust against collinearity. This is important in toxicology where poisonings share overlappingfeatures. For example an elevated heart rate can be seen in the anithcolinergic and sympathomimetictoxidromes as well as in the serotonin syndrome. The sympathomimetic toxidrome and serotoninsyndrome also both have elevated blood pressure.DTs and PLNs also require less training data than neural networks. This is advantageous formedical applications, where curated data are often tiny. All of the studies discussed above weredeveloped on 150 or fewer patient presentations.A limitation of decision trees is the tendency to overﬁt, which would correspond in clinicalpractice to bending diagnostic criteria so that each patient has at least one diagnosis. PLNs avoidthis overﬁtting because they are not trying to minimize the number of undiagnosed patients. We created 34 probabilistic logic rules based on the consensus of three medical toxicologiststo describe the medical knowledge base used to diagnose acute poisoning. We named these rulesand the underlying implementation in ProbLog,

Tak . We restricted ourselves to developing rulesthat described features that could be observed during one evaluation at a patient’s bedside withoutlaboratory testing. We followed these restrictions to assess our algorithm’s performance in the mosttime-sensitive aspect of medical toxicology.The rules were constructed as follows. We treated each ﬁnding elicited by the toxicologist asa predicate. A predicate is a function that returns only

True or False , dpeending on its input.4 .10::salivation(X,decreased);0.10::salivation(X,increased);0.80::salivation(X,usual).

Listing 1.

Example rule in probabilistic logic that represent knowledge from medicaltoxicology.

Fraction preceding each function denotes number of worlds in which that function istrue. Numbers sum to one across values.

Listing 2.

Example rule in probabilistic logic linking a symptoms to toxidromes.

Expression preceding each goal in disjunction (sequence of statements separated by semicolons)is evaluated when conditions of goal are satisﬁed. hasToxidrome(X,cholinergic) :-salivation(X, increased),urination(X, increased),pupilDiameter(X,small).

Listing 3.

Example Expression of Diagnosis of Toxidrome as Prolog Goal.

For example, the predicate salivation(X,increased) is true if patient demonstrates increasedsalivation. We considered all predicates representing clinical ﬁndings to require two inputs, thepatient and the value of the feature, usually {present|absent} or {increased|normal|decreased}.These values reﬂect a discretization of underlying continuous variables that reﬂect a common patternof communication with information compression between physicians.In Listing 1, the number before the two colons represents the probability with which theprobability is true. The semicolon represents logical exclusive disjunction, i.e.

A;B means “Aor B but not both”. A period terminates each logical statement. Listing 2 demonstrates assigningthe likelihood of one toxidrome over another given that the patient is manifesting a symptom. Acolon followed by a dash represents unidirectional implication, i.e.

A :- B means that B is true ifA is true. The function mentalStatus(X,agitated) is true if patient X is agitated. The function hasToxidrome(X,Y) is true if patient X manifests toxidrome Y.The relative probabilities across rules were chosen to reﬂect the perceived relative prevalenceof each clinical ﬁnding. We used the most recent annual report from American Association ofPoison Control Centers on the relative prevalence of each poisoning in the US to estimate the priorprobability of each toxidrome. The prevalence of many physical ﬁndings, for example hypersalivationin the general population, are not known. Nor is it known that a patient is exactly four times morelikely to suﬀer from a sympathomimetic toxidrome as opposed to serotonin toxicity if the patientbecomes agitated after an unknown ingestion. The magnitudes were chosen, in conjunction withthe consensus of experts, to reﬂect implicit components of clinical reasoning.

We generated 300 simulated toxidrome presentations as follows. We chose 300 as the minimumnumber needed to detect whether the inter-rater reliability between human consensus and

Tak was5reater than 0 .

2. We took a diﬀerence in inter-rater reliability of 0 . k . One toxidrome was the intended toxidrome. Theother became the distractor toxidrome. We created the presentation by choosing 5 − k signs fromthe intended toxidrome and k signs from the distractor toxidrome. The parameter k models thevariability of clinical presentation. A diﬃculty of 0 simulates an unequivocal presentation, whereall ﬁndings are canonically associated with the intended toxidrome. A diﬃculty of 2 simulates amixed picture, as might result from the ingestion of many substances with conﬂicting eﬀects. Tothe author’s knowledge, this is the ﬁrst data set created to evaluate the inter-rater reliability ofdiagnosis in medical toxicology. We compared the performance of

Tak with expert consensus, a decision tree, and recovery ofthe intended toxidrome generated by Algorithm 1. We presented the same 300 cases describedabove to

Tak as well as two human medical toxicologists.

Tak assigned the most likely rating toeach toxidrome based on the posterior probabilities calculated by the probabilistic logic engine.The human raters labeled each presentation with the toxidrome that most accurately captured thepresentation. We excluded presentations if either rater felt that no single toxidrome captured thepresentation or that the presentation was not clinically plausible.We use the term inferred toxidrome to denote the toxidrome that

Tak , the decision tree, orthe human raters inferred from the case presentation. We quantiﬁed inter-rater reliability usinga multinomial extension of Cohen’s κ . Cohen’s κ ranges between 0 and 1 where 0 indicates thelevel of agreement expected by chance and 1 indicates perfect agreement. Cohen’s κ is moreclinically relevant than accuracy because Cohen’s κ also considers the discordance in errors betweenraters. We felt it important that Tak perform similarly to humans in both accurate and inaccuratediagnoses.We used the inter-rater reliability between

Tak ’s inferred toxidrome and the intended toxidromeas a measure of best performance and between

Tak ’s inferred toxidrome and the consensus of thehuman raters as a measure of actual performance. We took the inter-rater reliability between theconsensus of the human raters and the intended toxidrome as a benchmark for actual performance.To provide a machine learning benchmark we trained a decision tree classiﬁer on the samecases. We trained one decision tree classiﬁer for each level of diﬃculty. Training one DT foreach level of diﬃculty is likely to lead to overﬁtting, but also will overestimate the decision tree’sperformance, providing a more stringent benchmark against which to evaluate

Tak . The treeswere not averaged nor was any random forest method used. We used the sklearn implementation,DecisionTreeClassiﬁer, with the maximum depth set to 3.

Figure 1 shows the overall organization of our study. 18 cases were omitted because the humanraters could not reach consensus and both said none of them were medically plausible, decreasingthe number of cases from 300 to 282. 6enerate Data Generate 300 cases,according to Alg [ref]Included( n = 282)ClinicallyImplausible( n = 18) Excluded Evaluate Tak ’s Performance

Actual:

Tak vs. Human Consensus

Best:

Tak vs. Intended Toxidrome

Benchmarks:

Human Consensus vs Intended ToxidromeDT vs Intended ToxidromeFigure 1.

Study ﬂow. Left:

Generation of data using Algorithm 1.

Right:

Evaluation ofdata. Actual denotes actual performance; best, best performance; DT, decision tree,

Tak , name ofcombination of probabilistic logic algorithm and knowledge base.Figure 2 summarizes the performances.

Tak ’s peak and actual performance were comparable.As the complexity of the case increased

Tak ’s performance decreased, as did the performance ofhuman experts. In the most diﬃcult cases,

Tak ’s accuracy approached that of the decision treeclassiﬁer, approximately half of human performance. In all cases,

Tak’s performance was betterthan chance ( κ = 0) and at least as good as the decision tree, our benchmark for current approaches. The inter-rater reliabilities between the toxidromes

Tak inferred and the intended toxidromeswere κ = 0 . . . κ = 0 . . . Tak developed diﬃculty distinguishing among the anticholinergic, cholinergic, and sedative-hypnotictoxidromes as well as between the opioid and cholinerigc toxidromes.

Our benchmark for usual performance was the inter-rater reliability between the consensus ofhuman raters and the intended toxidromes. The inter-rater reliability between the consensus of thehuman raters and the labels predicted by

Tak was κ = 0 . . . To compare

Tak ’s performance against other machine learning approaches, we calculated theinter-rater reliability between the ground truth labels and a decision tree classiﬁer. In straightforwardpresentations (diﬃculty, 0)

Tak outperformed the decision tree ( κ DT = 0 . κ T ak = . Tak outperformed the decision tree ( κ DT = 0 . κ T ak = . Tak performed comparably to the decisiontree ( κ DT = 0 . κ T ak = . . Diﬃculty

Figure 2.

Human and Computer Agreement

Y-axis denotes Cohen’s κ . X-axis denotes pair;Human raters, Intended (actual) labels, and Inferred (Tak) labels. Hue indicates diﬃculty ofpresentation. Refers to number of noncanonical symptoms in presentation. As the diﬃculty of presentations increased both

Tak and the human raters decreased in accuracy.This decrease in accuracy reﬂects the construction of the synthetic data set and the limits ofresolution of toxidromes. The synthetic data were constructed to have three levels of diﬃculty,corresponding to clinical reality. Some patients may ingest or be exposed to a large amount of onesubstance leading to an unequivocal presentations. Others may ingest or be exposed to a mixtureof substances with a variety of stimulating and sedating eﬀects, the balance of which shifts overtime as the chemicals are distributed throughout the body and metabolized at diﬀerent rates.The authors could ﬁnd no published formal analysis of the discriminative limits of toxidromes,but it stands to reason that 6 features may not be able to accurately classify 6 categories if allfeatures do not have values for all categories.

Tak confused the anticholinergic and sympathomimetic toxidromes and cholinergic, opioid,and sedative hypnotic toxidromes. This mimics diﬃculties that medical toxicologists have. Theanticholinergic and sympathomimetic toxidromes share overlapping features (increased heart rate,increased blood pressure, and agitated mental status). The cholinergic, opioid, and sedative-hypnotictoxidromes share overlapping features (sedated mental status, and in the case of the cholinergicand opioid toxidromes slowed breathing and small pupils).

Tak was able to distinguish serotonergictoxicity from the anticholinergic and cholinergic toxidromes because serotonergic toxicity has uniquefeatures. Future work can explore the sensitivity of classiﬁcation to each rule.

The goal of this study was to use probabilistic logic to model medical decision-making in medicaltoxicology. We derived probabilistic logic rules from expert consensus to represent background8edical knowledge and constructed a probabilistic logic network,

Tak . We evaluated

Tak ’s performanceon a synthetic data set and compared its performance against the consensus of expert cliniciansand a decision tree classiﬁer.The misclassiﬁcation errors made by

Tak resemble those by humans. The decrease in inter-raterreliability for more diﬃcult cases arose, in part, from the increased chance of the introductionof nondiscriminative physical ﬁndings. Toxidromes share overlapping features. For example, anelevated heart rate can be a sign of the anticholinergic or sympathomimetic toxidrome and, indeed,it can be diﬃcult for clinicians to distinguish these two processes without further information. Forexample, in a poisoned patient the heart rate and blood pressure usually move in tandem, bothrising or both falling. This collinearity is such a hallmark of poisoned patients that its absence canprompt medical toxicologist to consider nontoxicological causes of the patient’s condition.

Tak ’s error rate needs improvement before clinical use. Even if

Tak never reaches accuracycomparable to human physicians across all levels of diﬃculty, it could be used to automate theprocessing of more routine cases, where it’s performance is comparable to humans, freeing upphysician time to deal with more complex cases.The most signiﬁcant limitation of this paper is the use of synthetic rather than actual datato evaluate our approach. We used synthetic data because no clinical data set currently exists.We took steps to generate realistic data, creating cases at 3 levels of complexity to capture theheterogeneity of clinical data. We reviewed all cases with medical toixcologists for plausibility. Ourdata set dotes not fully evaluate

Tak’s performance. It would require (cid:0) (cid:1) · unique cases tofully evaluate all possible patient presentations, even clinically implausible ones. Our evaluationof actual performance is also limited by the omission of cases on whom the raters could not reachconsensus. The full value of our approach in helping physicians treat poisoned patients will not beknown until our approach can be evaluated on actual clinical data.For most variables, empiric probability distributions were not available. This renders theabsolute values of the calculated posterior probabilities uninformative even if the relative magnitudeis still informative.Physicians must trust an AI-based system to include it in their evaluation and treatment ofpatients. An algorithm can earn that trust through proﬁciency on complex cases and transparency. Tak demonstrates transparent clinical reasoning. This transparency, if preserved in more accuratemodels, may remove barriers to the use of AI approaches in clinical decision making. Even if amore detailed analysis of the limits of PLNs suggests a unimprovably poor performance on complexcases, a transparent AI-system may be useful by automating aspects routine cases and in doing sofreeing up expert time for more complicate cases.The main contribution of this paper is the demonstration that probabilistic logic networks canmodel toxicologic knowledge in a way that transparently mimic physician thought. An additionalcontribution of this paper is that it is one of the ﬁrst, to the authors’ knowledge, to quantify theinter-rater reliability of physicians in diagnosing a speciﬁc type of poisoned patient. Yet anothercontribution of this paper is the development of a data set that unsupervised or weakly supervisedtechniques could use to explore other ways of representing knowledge in this domain.

References [1] A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,”

New England Journalof Medicine , vol. 380, no. 14, pp. 1347–1358, 2019.92] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz,K. Shpanskaya, et al. , “Chexnet: Radiologist-level pneumonia detection on chest x-rays withdeep learning,” arXiv preprint arXiv:1711.05225 , 2017.[3] J. J. Thiagarajan, P. Sattigeri, D. Rajan, and B. Venkatesh, “Calibrating healthcare ai:Towards reliable and interpretable deep predictive models,” arXiv preprint arXiv:2004.14480 ,2020.[4] C. P. Holstege and H. A. Borek, “Toxidromes,”

Critical care clinics , vol. 28, no. 4, pp. 479–498,2012.[5] E. W. Boyer and M. Shannon, “The serotonin syndrome,”

New England Journal of Medicine ,vol. 352, no. 11, pp. 1112–1120, 2005.[6] I. Kononenko, “Machine learning for medical diagnosis: history, state of the art andperspective,”

Artiﬁcial Intelligence in medicine , vol. 23, no. 1, pp. 89–109, 2001.[7] A. Wibowo, Y. Rahayu, A. Riyanto, and T. Hidayatulloh, “Classiﬁcation algorithm foredible mushroom identiﬁcation,” in , pp. 250–253, IEEE, 2018.[8] D. S. Prasvita and Y. Herdiyeni, “Medleaf: mobile application for medicinal plant identiﬁcationbased on leaf image,”

International Journal on Advanced Science, Engineering and InformationTechnology , vol. 3, no. 2, pp. 5–8, 2013.[9] S. Sharma and C. Gupta, “A review of plant recognition methods and algorithms,”

International Journal of Innovative Research in Advanced Engineering , vol. 2, no. 6,pp. 111–116, 2015.[10] P. Agarwal, R. Verma, and A. Mallik, “Ontology based disease diagnosis system withprobabilistic inference,” in , pp. 1–5, IEEE, 2016.[11] J. Jiang, X. Li, C. Zhao, Y. Guan, and Q. Yu, “Learning and inference in knowledge-basedprobabilistic model for medical diagnosis,”

Knowledge-Based Systems , vol. 138, pp. 58–68,2017.[12] V. Zarikas, E. Papageorgiou, D. Pernebayeva, and N. Tursynbek, “Medical decision supporttool from a fuzzy-rules driven bayesian network.,” in

ICAART (2) , pp. 539–549, 2018.[13] E. H. Shortliﬀe, “Mycin: a rule-based computer program for advising physicians regardingantimicrobial therapy selection.,” tech. rep., STANFORD UNIV CALIF DEPT OFCOMPUTER SCIENCE, 1974.[14] C. D. Naylor, “On the prospects for a (deep) learning health care system,”

Jama , vol. 320,no. 11, pp. 1099–1100, 2018.[15] S. P. Frysinger, M. L. Deaton, A. G. Gonzalo, A. M. VanHorn, and M. A. Kirk, “The falcondecision support system: Preparing communities for weapons of opportunity,”

EnvironmentalModelling & Software , vol. 22, no. 4, pp. 431–435, 2007.1016] B. Goertzel, M. Iklé, I. F. Goertzel, and A. Heljakka,

Probabilistic logic networks: Acomprehensive framework for uncertain inference . Springer Science & Business Media, 2008.[17] A. Kimmig, B. Demoen, L. De Raedt, V. S. Costa, and R. Rocha, “On the implementationof the probabilistic logic programming language problog,”

Theory and Practice of LogicProgramming , vol. 11, no. 2-3, pp. 235–262, 2011.[18] L. De Raedt and K. Kersting, “Probabilistic logic learning,”

ACM SIGKDD ExplorationsNewsletter , vol. 5, no. 1, pp. 31–48, 2003.[19] D. Lavanya and K. U. Rani, “Ensemble decision tree classiﬁer for breast cancer data,”

International Journal of Information Technology Convergence and Services , vol. 2, no. 1, p. 17,2012.[20] J. Soni, U. Ansari, D. Sharma, and S. Soni, “Predictive data mining for medical diagnosis: Anoverview of heart disease prediction,”

International Journal of Computer Applications , vol. 17,no. 8, pp. 43–48, 2011.[21] A. A. Al Jarullah, “Decision tree discovery for the diagnosis of type ii diabetes,” in , pp. 303–307, IEEE, 2011.[22] K. Polat and S. Güneş, “Classiﬁcation of epileptiform eeg using a hybrid system based ondecision tree classiﬁer and fast fourier transform,”

Applied Mathematics and Computation ,vol. 187, no. 2, pp. 1017–1026, 2007.[23] J. Larsen, M. B. Mycyk, and T. M. Thompson, “Reviewing the record: Medical record reviewsfor medical toxicology research,” 2018.[24] L. De Raedt, A. Kimmig, and H. Toivonen, “Problog: A probabilistic prolog and its applicationin link discovery.,” in

IJCAI , vol. 7, pp. 2462–2467, Hyderabad, 2007.11 upplemental Material

In the Supplemental Material we provide further background on the Mathematics of ProbabilisitcLogic Networks, and our Generation of Simulated Patients.

Mathematics of Probabilistic Logic Networks

Equation 2 shows how the probability associated with a query Q is related to the logical rulesand their associated probabilities. The sum ranges over all worlds in which the supplied facts, F ,and stated rules, R , imply that the query Q is true. The products range over all worlds. Theﬁrst product calculates the joint probability of the supplied facts being true. The second productcalculates the joint probability of other facts being false. In our case, R is the set of rules specifyingthe relationships between signs and toxidromes. The supplied facts, F , correspond to signs of aparticular patient. The query Q is for each toxidrome. The metaquery or directive is to ﬁnd thequery (toxidrome) that maximizes Equation 2. P ( Q ) = X F ∪ R | = Q Y f ∈ F p ( f ) Y f F − p ( f ) (2) Generation of Simulated Patients.

Algorithm 1 describes how we generated each patient presentation from an intended and distractortoxidrome according to the mixing parameter k . Algorithm 1

Generation of simulated toxidrome

Precondition: n ← ⊲ Max signs per presentation