Integrating Learning from Examples into the Search for Diagnostic Policies
JJournal of Arti(cid:12)cial Intelligence Research 24 (2005) 263-303 Submitted 06/04; published 08/05Integrating Learning from Examples into the Search forDiagnostic PoliciesValentina Bayer-Zubek [email protected] G. Dietterich [email protected] of Electrical Engineering and Computer Science,Dearborn Hall 102, Oregon State University,Corvallis, OR 97331-3102 USA AbstractThis paper studies the problem of learning diagnostic policies from training examples. Adiagnostic policy is a complete description of the decision-making actions of a diagnostician(i.e., tests followed by a diagnostic decision) for all possible combinations of test results.An optimal diagnostic policy is one that minimizes the expected total cost, which is thesum of measurement costs and misdiagnosis costs. In most diagnostic settings, there is atradeo(cid:11) between these two kinds of costs.This paper formalizes diagnostic decision making as a Markov Decision Process (MDP).The paper introduces a new family of systematic search algorithms based on the AO(cid:3) algo-rithm to solve this MDP. To make AO(cid:3) e(cid:14)cient, the paper describes an admissible heuristicthat enables AO(cid:3) to prune large parts of the search space. The paper also introduces severalgreedy algorithms including some improvements over previously-published methods. Thepaper then addresses the question of learning diagnostic policies from examples. When theprobabilities of diseases and test results are computed from training data, there is a greatdanger of over(cid:12)tting. To reduce over(cid:12)tting, regularizers are integrated into the search al-gorithms. Finally, the paper compares the proposed methods on (cid:12)ve benchmark diagnosticdata sets. The studies show that in most cases the systematic search methods producebetter diagnostic policies than the greedy methods. In addition, the studies show thatfor training sets of realistic size, the systematic search algorithms are practical on today'sdesktop computers.1. IntroductionA patient arrives at a doctor's o(cid:14)ce complaining of symptoms such as fatigue, frequenturination, and frequent thirst. The doctor performs a sequence of measurements. Someof the measurements are simple questions (e.g., asking the patient's age, medical history,family history of medical conditions), others are simple tests (e.g., measure body mass index,blood pressure), and others are expensive tests (e.g., blood tests). After each measurement,the doctor analyzes what is known so far and decides whether there is enough informationto make a diagnosis or whether more tests are needed. When making a diagnosis, the doctormust take into account the likelihood of each possible disease and the costs of misdiagnoses.For example, diagnosing a diabetic patient as healthy incurs the cost of aggravating thepatient's medical condition and delaying treatment; diagnosing a healthy patient as havingdiabetes incurs the costs of unnecessary treatments. When the information that has beengathered is su(cid:14)ciently conclusive, the doctor then makes a diagnosis.c(cid:13)2005 AI Access Foundation. All rights reserved.ayer-Zubek & DietterichWe can formalize this diagnostic task as follows. Given a patient, the doctor can executea set of N possible measurements x1; : : : ; xN . When measurement xn is executed, the resultis an observed value vn. For example, if x1 is \patient's age", then v1 could be 36 (years).Each measurement xn has an associated cost C(xn). The doctor also can choose one of Kdiagnosis actions. Diagnosis action fk diagnoses the patient as su(cid:11)ering from disease k. Wewill denote the correct diagnosis of the patient by y. The misdiagnosis cost of predictingdisease k when the correct diagnosis is y is denoted by M C(fk; y).The process of diagnosis consists of a sequence of decisions. In the starting state, nomeasurements or diagnoses have been made. We denote this by the empty set fg. Supposethat in this starting \knowledge state", the doctor chooses measurement x1 and receivesthe result that x1 = 36 at a cost of $0.50. This is modeled as a transition to the knowledgestate fx1 = 36g with a cost of C(x1) = 0:5. Now suppose the doctor chooses x3, whichmeasures body mass index, and receives a result x3 = small at a cost of $1. This changes theknowledge state to fx1 = 36; x3 = smallg at a cost of C(x3) = 1. Finally, the doctor makesthe diagnosis \healthy". Suppose that the correct diagnosis is y = diabetes. For illustrativepurposes,1 suppose that the cost of this misdiagnosis is M C(healthy; diabetes) = $100. Thediagnosis action terminates the process, with a total cost of :5 + 1 + 100 = 101:5.We can summarize the decision-making process of the doctor in terms of a diagnosticpolicy, (cid:25). The diagnostic policy speci(cid:12)es for each possible knowledge state s, what action(cid:25)(s) to take, where the action can be one of the N measurement actions or one of the Kdiagnosis actions. Every diagnostic policy has an expected total cost, which depends onthe joint probability distribution P (x1; : : : ; xN ; y) over the test results and the true diseaseof the patients and on the costs C(xn) and M C(fk; y). The optimal diagnostic policyminimizes this expected total cost by choosing the best tradeo(cid:11) point between the cost ofperforming more measurements and the cost of misdiagnosis. Every measurement gathersinformation, which reduces the risk of a costly misdiagnosis. But every measurement incursa measurement cost.Diagnostic decision making is most challenging when the costs of measurement andmisdiagnosis have similar magnitudes. If measurement costs are very small compared tomisdiagnosis costs, then the optimal diagnostic policy is to measure everything and thenmake a diagnostic decision. Conversely, if misdiagnosis costs are very small compared tomeasurement costs, then the best policy is to measure nothing and just diagnose based onmisdiagnosis costs and prior probabilities of the diseases.Learning cost-sensitive diagnostic policies is important in many domains, from medicineto automotive troubleshooting to network fault detection and repair (Littman et al., 2004).We note that this formulation of optimal diagnosis assumes that all costs can be ex-pressed on a single numerical scale, that, although it need not correspond to economic cost,must support the principle of choosing actions by minimizing expected total cost. In medi-cal diagnosis, there is a large body of work on methods for eliciting the patient's preferencesand summarizing them as a utility or cost function (e.g., Lenert & Soetikno, 1997).This paper studies the problem of learning diagnostic policies from training examples.We assume that we are given a representative set of complete training examples drawn fromP (x1; : : : ; xN ; y) and that we are told the measurement costs and misdiagnosis costs. This1. The true cost of misdiagnosing diabetes would depend on the age of the patient and the degree ofprogression of the disease, but in any case, it would probably be much higher than $100.264earning Diagnostic Policies from Exampleskind of training data could be collected, for example, through a clinical trial in which allmeasurements were performed on all patients. Because of the costs involved in collectingsuch data, we assume that the training data sets will be relatively small (hundreds or afew thousands of patients; not millions). The goal of this paper is to develop learningalgorithms for (cid:12)nding good diagnostic policies from such modest-sized training data sets.Unlike other work on test selection for diagnosis (Heckerman, Horvitz, & Middleton, 1993;van der Gaag & Wessels, 1993; Madigan & Almond, 1996; Dittmer & Jensen, 1997), we donot assume that a Bayesian network or in(cid:13)uence diagram is provided; instead we directlylearn a diagnostic policy from the data.This framework of diagnosis ignores several issues that we hope to address in futureresearch. First, it assumes that each measurement action has no e(cid:11)ect on the patient.Each measurement action is a pure observation action. In real medical and equipmentdiagnosis situations, some actions may also be attempted therapies or attempted repairs.These repairs may help cure the patient or (cid:12)x the equipment, in addition to gatheringinformation. Our approach does not handle attempted repair actions.Second, this framework assumes that measurement actions are chosen and executed one-at-a-time and that the cost of an action does not depend on the order in which the actionsare executed. This is not always true in medical diagnosis. For example, when orderingblood tests, the physician can choose to order several di(cid:11)erent tests as a group, which costsmuch less than if the tests are ordered individually.Third, the framework assumes that the result of each measurement action is available be-fore the diagnostician must choose the next action. In medicine, there is often a (stochastic)delay between the time a test is ordered and the time the results are available. Fragmentaryresults may arrive over time, which may lead the physician to order more tests before allpreviously-ordered results are available.Fourth, the framework assumes that measurement actions are noise-free. That is, repeat-ing a measurement action will obtain exactly the same result. Therefore once a measurementaction is executed, it never needs to be repeated.Fifth, the framework assumes that the results of the measurements have discrete values.We enforce this via a pre-processing discretization step.These assumptions allow us to represent the doctor's knowledge state by the set ofpartial measurement results: fx1 = v1; x3 = v3; : : :g and to represent the entire diagnosticprocess as a Markov Decision Process (MDP). Any optimal solution to this MDP providesan optimal diagnostic policy.Given this formalization, there are conceptually two problems that must be addressed inorder to learn good diagnostic policies. First, we must learn the joint probability distributionP (x1; : : : ; xN ; y). Second, we must solve the resulting MDP for an optimal policy.In this paper, we begin by addressing the second problem. We show how to apply theAO(cid:3) algorithm to solve the MDP for an optimal policy. We de(cid:12)ne an admissible heuristicfor AO(cid:3) that allows it to prune large parts of the state space, so that this search becomesmore e(cid:14)cient. This addresses the second conceptual problem.However, instead of solving the (cid:12)rst conceptual problem (learning the joint distributionP (x1; : : : ; xN ; y)) directly, we argue that the best approach is to integrate the learningprocess into the AO(cid:3) search. There are three reasons to pursue this integration. First,by integrating learning into the search, we ensure that the probabilities computed during265ayer-Zubek & Dietterichlearning are the probabilities relevant to the task. If instead we had just separately learnedsome model of the joint distribution P (x1; : : : ; xN ; y), those probabilities would have beenlearned in a task-independent way, and long experience in machine learning has shown thatit is better to exploit the task in guiding the learning process (e.g., Friedman & Goldszmidt,1996; Friedman, Geiger, & Goldszmidt, 1997).Second, by integrating learning into the search, we can introduce regularization methodsthat reduce the risk of over(cid:12)tting. The more thoroughly a learning algorithm searches thespace of possible policies, the greater the risk of over(cid:12)tting the training data, which results inpoor performance on new cases. The main contribution of this paper (in addition to showinghow to model diagnosis as an MDP) is the development and careful experimental evaluationof several methods for regularizing the combined learning and AO(cid:3) search process.Third, the integration of learning with AO(cid:3) provides additional opportunities to prunethe AO(cid:3) search and thereby improve the computational e(cid:14)ciency of the learning process.We introduce a pruning technique, called \statistical pruning", that simultaneously reducesthe AO(cid:3) search space and also regularizes the learning procedure.In addition to applying the AO(cid:3) algorithm to perform a systematic search of the space ofdiagnostic policies, we also consider greedy algorithms for constructing diagnostic policies.These algorithms are much more e(cid:14)cient than AO(cid:3), but we show experimentally that theygive worse performance in several cases. Our experiments also show that AO(cid:3) is feasible onall (cid:12)ve diagnostic benchmark problems that we studied.The remainder of the paper is organized as follows. First, we discuss the relationshipbetween the problem of learning minimum cost diagnostic policies and previous work incost-sensitive learning and diagnosis. In Section 3, we formulate this diagnostic learningproblem as a Markov Decision Problem. Section 4 presents systematic and greedy searchalgorithms for (cid:12)nding good diagnostic policies. In Section 5, we take up the question oflearning good diagnostic policies and describe our various regularization methods. Section 6presents a series of experiments that measure the e(cid:11)ectiveness and e(cid:14)ciency of the variousmethods on real-world data sets. Section 7 summarizes the contributions of the paper anddiscusses future research directions.2. Relationship to Previous ResearchThe problem of learning diagnostic policies is related to several areas of previous researchincluding cost-sensitive learning, test sequencing, and troubleshooting. We discuss each ofthese in turn.2.1 Cost-Sensitive LearningThe term \cost-sensitive learning" denotes any learning algorithm that is sensitive to one ormore costs. Turney (2000) provides an excellent overview. Cost-sensitive learning employsclassi(cid:12)cation terminology in which a class is a possible outcome of the classi(cid:12)cation process.This corresponds in our case to the diagnosis. The forms of cost-sensitive learning mostrelevant to our work concern methods sensitive to misclassi(cid:12)cation costs, methods sensitiveto measurement costs, and methods sensitive to both kinds of costs.Learning algorithms sensitive to misclassi(cid:12)cation costs have received signi(cid:12)cant atten-tion. In this setting, the learning algorithm is given (at no cost) the results of all possible266earning Diagnostic Policies from Examplesmeasurements, (v1; : : : ; vN ). It must then make a prediction ^y of the class of the example,and it pays a cost M C(^y; y) when the correct class is y. Important work in this settingincludes the papers of Breiman et al. (1984), Pazzani et al. (1994), Fawcett and Provost(1997), Bradford et al. (1998), Domingos (Domingos, 1999), Zadrozny and Elkan (2001),and Provost and Fawcett (2001).A few researchers in machine learning have studied application problems in which thereis a cost for measuring each attribute (Norton, 1989; Nunez, 1991; Tan, 1993). In thissetting, the goal is to minimize the number of misclassi(cid:12)cation errors while biasing thelearning algorithm in favor of less-expensive attributes. From a formal point of view, thisproblem is ill-de(cid:12)ned, because there is no explicit de(cid:12)nition of an objective function thattrades o(cid:11) the cost of measuring attributes against the number of misclassi(cid:12)cation errors.Nonetheless, several interesting heuristics were implemented and tested in these papers.More recently, researchers have begun to consider both measurement and misclassi(cid:12)-cation costs (Turney, 1995; Greiner, Grove, & Roth, 2002). The objective is identical tothe one studied in this paper: to minimize the expected total cost of measurements andmisclassi(cid:12)cations. Both algorithms learn from data as well.Turney developed ICET, an algorithm that employs genetic search to tune parametersthat control a classi(cid:12)cation-tree learning algorithm. Each classi(cid:12)cation tree is built usinga criterion that selects attributes greedily, based on their information gain and estimatedcosts. The measurement costs are adjusted in order to build di(cid:11)erent classi(cid:12)cation trees;these trees are evaluated on an internal holdout set using the real measurement and misclas-si(cid:12)cation costs. The best set of measurement costs found by the genetic search is employedto build the (cid:12)nal classi(cid:12)cation tree on the entire training data set.Greiner et al.'s paper provides a PAC-learning analysis of the problem of learning anoptimal diagnostic policy|provided that the policy makes no more than L measurements,where L is a (cid:12)xed constant. Recall that N is the total number of measurements. They provethat there exists an algorithm that runs in time polynomial in N , consumes a number oftraining examples polynomial in N , and (cid:12)nds a diagnostic policy that, with high probability,is close to optimal. Unfortunately, the running time and the required number of examples isexponential in L. In e(cid:11)ect, their algorithm works by estimating, with high con(cid:12)dence, thetransition probabilities and the class probabilities in states where at most L of the valuesx1 = v1, . . . , xN = vN have been observed. Then the value iteration dynamic programmingalgorithm is applied to compute the best diagnostic policy with at most L measurements.In theory, this works well, but it is di(cid:14)cult to convert this algorithm to work in practice.This is because the theoretical algorithm chooses the space of possible policies and thencomputes the number of training examples needed to guarantee good performance, whereasin a real setting, the number of available training examples is (cid:12)xed, and it is the space ofpossible policies that must be adapted to avoid over(cid:12)tting.2.2 Test SequencingThe (cid:12)eld of electronic systems testing has formalized and studied a problem called thetest sequencing problem (Pattipati & Alexandridis, 1990). An electronic system is viewedas being in one of K possible states. These states include one fault-free state and K (cid:0) 1faulty states. The relationship between tests (measurements) and system states is speci(cid:12)ed267ayer-Zubek & Dietterichas a binary diagnostic matrix which tells whether test xn detects fault fi or not. Theprobabilities of the di(cid:11)erent system states y are speci(cid:12)ed by a known distribution P (y).A test sequencing policy performs a series of measurements to identify the state of thesystem. In test sequencing, it is assumed that the measurements are su(cid:14)cient to determinethe system state with probability 1. The objective is to (cid:12)nd the test sequencing policy thatachieves this while minimizing the expected number of tests. Hence, misdiagnosis costs areirrelevant, because the test sequencing policy must guarantee zero misdiagnoses. Severalheuristics for AO(cid:3) have been applied to compute the optimal test sequencing policy (Pat-tipati & Alexandridis, 1990).The test sequencing problem does not involve learning from examples. The requiredprobabilities are provided by the diagnostic matrix and the fault distribution P (y).2.3 TroubleshootingAnother task related to our work is the task of troubleshooting (Heckerman, Breese, &Rommelse, 1994). Troubleshooting begins with a system that is known to be functioningincorrectly and ends when the system has been restored to a correctly-functioning state.The troubleshooter has two kinds of actions: pure observation actions (identical to our mea-surement actions) and repair actions (e.g., removing and replacing a component, replacingbatteries, (cid:12)lling the gas tank, rebooting the computer, etc.). Each action has a cost, andthe goal is to (cid:12)nd a troubleshooting policy that minimizes the expected cost of restoringthe system to a correctly-functioning state.Heckerman et al. (1994, 1995) show that for the case where the only actions are purerepair actions and there is only one broken component, there is a very e(cid:14)cient greedyalgorithm that computes the optimal troubleshooting policy. They incorporate pure ob-servation actions via a one-step value of information (VOI) heuristic. According to thisheuristic, they compare the expected cost of a repair-only policy with the expected cost of apolicy that makes exactly one observation action and then executes a repair-only policy. Ifan observe-once-and-then-repair-only policy is better, they execute the chosen observationaction, obtain the result, and then again compare the best repair-only policy with the bestobserve-once-and-then-repair-only policy. Below, we de(cid:12)ne a variant of this VOI heuristicand compare it to the other greedy and systematic search algorithms developed in thispaper.Heckerman et al. consider only the case where the joint distribution P (x1; : : : ; xN ; y) isprovided by a known Bayesian network. To convert their approach into a learning approach,they could (cid:12)rst learn the Bayesian network and then compute the troubleshooting policy.But we suspect that an approach that integrates the learning of probabilities into thesearch for good policies|along the lines described in this paper|would give better results.Exploring this question is an important direction for future research.3. Formalizing Diagnosis as a Markov Decision ProblemThe process of diagnosis is a sequential decision making process. After every decision, thediagnostician must decide what to do next (perform another measurement, or terminate bymaking a diagnosis). This can be modeled as a Markov Decision Problem (MDP).268earning Diagnostic Policies from ExamplesAn MDP is a mathematical model for describing the interaction of an agent with anenvironment. An MDP is de(cid:12)ned by a set of states S (including the start state), an action setA, the transition probabilities Ptr(s0js; a) of moving from state s to state s0 after executingaction a, and the (expected immediate) costs C(s; a; s0) associated with these transitions.Because the state representation contains all the relevant information for future decisions,it is said to exhibit the Markov property.A policy (cid:25) maps states into actions. The value of a state s under policy (cid:25), V (cid:25)(s), is theexpected sum of future costs incurred when starting in state s and following (cid:25) afterwards(Sutton & Barto, 1999, chapter 3). The value function V (cid:25) of a policy (cid:25) satis(cid:12)es the followingrecursive relationship, known as the Bellman equation for V (cid:25):V (cid:25)(s) = Xs02S Ptr(s0js; (cid:25)(s)) (cid:2) (cid:2)C(s; (cid:25)(s); s0) + V (cid:25)(s0)(cid:3) ; 8(cid:25); 8s: (1)This can be viewed as a one-step lookahead from state s to each of the next states s0 reachedafter executing action (cid:25)(s). Given a policy (cid:25), the value of state s can be computed fromthe value of its successor states s0, by adding the expected costs of the transitions, thenweighting them by the transition probabilities.Solving the MDP means (cid:12)nding a policy with the smallest value. Such a policy is calledthe optimal policy (cid:25)(cid:3), and its value is the optimal value function V (cid:3). Value iteration is analgorithm that solves MDPs by iteratively computing V (cid:3) (Puterman, 1994).The problem of learning diagnostic policies can be represented as an MDP. We (cid:12)rstde(cid:12)ne the actions of this MDP, then the states, and (cid:12)nally the transition probabilities andcosts. All costs are positive.As discussed above, we assume that there are N measurement actions (tests) and Kdiagnosis actions. Measurement action n (denoted xn) returns the value of attribute xn,which we assume is a discrete variable with Vn possible values. Diagnosis action k (denotedfk) is the act of predicting that the correct diagnosis of the example is k. An action(measurement or diagnosis) is denoted by a.In our diagnostic setting, a case is completely described by the results of all N measure-ment actions and the correct diagnosis y: (v1; : : : ; vN ; y). In our framework, each case isdrawn independently according to an (unknown) joint distribution P (x1; : : : ; xN ; y). Oncea case is drawn, all the values de(cid:12)ning it stay constant. Test xn reveals to the diagnosticagent the value xn = vn of this case. As a consequence, once a case has been drawn, theorder in which the tests are performed does not change the values that will be observed.That is, the joint distribution P (xi = vi; xj = vj) is independent of the order of the testsxi and xj.It follows that we can de(cid:12)ne the state of the MDP as the set of all attribute-value pairsobserved thus far. This state representation has the Markov property because it contains allrelevant past information. There is a unique start state, s0 = fg, in which no attributes havebeen measured. The set of all states S contains one state for each possible combination ofmeasured attributes, as found in the training data. Each training example provides evidencefor the reachability of 2N states. The set A(s) of actions executable in state s consists ofthose attributes not yet measured plus all of the diagnosis actions.We also de(cid:12)ne a special terminal state sf . Every diagnosis action makes a transition tosf with probability 1 (i.e., once a diagnosis is made, the task terminates). By de(cid:12)nition, no269ayer-Zubek & Dietterichactions are executable in the terminal state, and its value function is zero. Note that theterminal state is always reached, because there are only (cid:12)nitely-many measurement actionsafter which a diagnosis action must be executed.We now de(cid:12)ne the transition probabilities and the immediate costs of the MDP. Formeasurement action xn executed in state s, the result state will be s0 = s [ fxn = vng,where vn is the observed value of xn. The expected cost of this transition is denoted C(xn),since we assume it depends only on which measurement action xn is executed, and not onthe state in which it is executed nor the resulting value vn that is observed. The probabilityof this transition is Ptr(s0js; xn) = P (xn = vnjs).The misdiagnosis cost of diagnosis action fk depends on the correct diagnosis y of theexample. Let M C(fk; y) be the misdiagnosis cost of guessing diagnosis k when the correctdiagnosis is y. Because the correct diagnosis y of an example is not part of the staterepresentation, the cost of a diagnosis action (which depends on y) performed in state smust be viewed as a random variable whose value is M C(fk; y) with probability P (yjs),which is the probability that the correct diagnosis is y given the current state s. Hence,our MDP has a stochastic cost function for diagnosis actions. This does not lead to anydi(cid:14)culties, because all that is required to compute the optimal policy for an MDP is theexpected cost of each action. In our case, the expected cost of diagnosis action fk in state sis C(s; fk) = Xy P (yjs) (cid:1) M C(fk; y); (2)which is independent of y.For uniformity of notation, we will write the expected immediate cost of action a instate s as C(s; a), where a can be either a measurement action or a diagnosis action.For a given start state s0, the diagnostic policy (cid:25) is a decision tree (Rai(cid:11)a, 1968).Figure 1 illustrates a simple example of a diagnostic policy. The root is the starting states0 = fg. Each node is labeled with a state s and a corresponding action (cid:25)(s). If the actionis a measurement action, xn, the possible results are the di(cid:11)erent possible observed valuesvn, leading to children nodes. If the action is a diagnosis action, fk, the possible results arethe diagnoses y. If (cid:25)(s) is a measurement action, the node is called an internal node, and if(cid:25)(s) is a diagnosis action, the node is called a leaf node. Each branch in the tree is labeledwith its probability of being followed (conditioned on reaching its parent node). Each nodes is labeled with V (cid:25)(s), the expected total cost of executing the diagnostic policy startingat node s. Notice that the value of a leaf is the expected cost of diagnosis, C(s; fk).The fact that a diagnostic policy is a decision tree is potentially confusing, because asimilar data structure, the classi(cid:12)cation tree (often also called a decision tree), has beenthe focus of so much work in the machine learning literature (e.g., Quinlan, 1993). Itis important to remember that whereas the evaluation criterion for a classi(cid:12)cation treeis the misclassi(cid:12)cation error rate, the evaluation criterion for a decision tree diagnosticpolicy is the expected total cost of diagnosis. One way of clarifying this di(cid:11)erence is tonote that a given classi(cid:12)cation tree can be transformed into many equivalent classi(cid:12)cationtrees by changing the order in which the tests are performed (see Utgo(cid:11)'s work on treemanipulation operators, Utgo(cid:11), 1989). These equivalent classi(cid:12)ers all implement the sameclassi(cid:12)cation function y = f (x1; : : : ; xN ). But if we consider these \equivalent" trees asdiagnostic policies, they will have di(cid:11)erent expected total diagnosis costs, because tests270earning Diagnostic Policies from Examples .3.7 1000y = Diabetesy = Healthy 0y = Healthyy = DiabetesHealthyDiabeteshighlow.2.8Insulinlarge.5 Healthy 8028.99 .845.9810 20241000.1 y = Diabetesy = Healthy.9 { BMI = large, Insulin = high } { BMI = large, Insulin = low} { BMI = small } { BMI = large } 22.781 .2small.5BMI{ }
Figure 1: An example of diagnostic policy (cid:25) for diabetes. Body Mass Index (BMI) is tested(cid:12)rst. If it is small, a Healthy diagnosis is made. If BMI is large, Insulin is testedbefore making a diagnosis. The costs of measurements (BMI and Insulin) arewritten below the name of the variable. The costs of misdiagnoses are writtennext to the solid squares. Probabilities are written on the branches. The valuesof the states are written below each state. The value of the start state, V (cid:25)(s0) =28:99, can be computed in a single sweep, starting at the leaves, as follows. Firstthe expected costs of the diagnosis actions are computed (e.g., the upper-mostDiabetes diagnosis action has an expected cost of 0:7 (cid:2) 0 + 0:3 (cid:2) 80 = 24). Thenthe value of the Insulin subtree is computed as the cost of measuring Insulin(22.78) + 0:8 (cid:2) 24 + 0:2 (cid:2) 20 = 45:98. Finally, the value of the whole tree iscomputed as the cost of measuring BMI (1) + 0:5 (cid:2) 45:98 + 0:5 (cid:2) 10 = 28:99. y = DiabetesHealthyDiabetes1240.138 { Insulin = low } 22.78{ } 21.4.57.43 lowhigh { Insulin = high } y = HealthyInsulin 1224 .12.88{ BMI = small, Insulin = low } smalllarge.3.7 { BMI = large, Insulin = low} 80.3.7 1000y = Diabetesy = Healthy 01BMI .88.12 1000y = Diabetesy = HealthyHealthy
Figure 2: Another diagnostic policy (cid:25)2, making the same classi(cid:12)cation decisions as (cid:25) inFigure 1, but with a changed order of tests, and therefore with a di(cid:11)erent policyvalue.closer to the root of the tree will be executed more often, so their measurement costs willmake a larger contribution to the total diagnosis cost. For example, the policy (cid:25) in Figure 1(cid:12)rst performs a cheap test, BMI. This policy has a value of 28:99. The tree (cid:25)2 in Figure 2makes the same classi(cid:12)cation decisions (with an error rate of 19%), but it (cid:12)rst tests Insulin,which is more expensive, and this increases the policy value to 40:138.271ayer-Zubek & Dietterich4. Searching for Good Diagnostic PoliciesWe now consider systematic and greedy search algorithms for computing diagnostic policies.In this section, we will assume that all necessary probabilities are known. We defer thequestion of learning those probabilities to Section 5. We note that this is exactly what allprevious uses of AO(cid:3) have done. They have always assumed that the required probabilitiesand costs were known.Given the MDP formulation of the diagnostic process, we could proceed by constructingthe entire state space and then applying dynamic programming algorithms (e.g., valueiteration or policy iteration) to (cid:12)nd the optimal policy. However, the size of the state spaceis exponential: given N measurement actions, each with V possible outcomes, there are (V +1)N + 1 states in the MDP (counting the special terminal state sf , and taking into accountthat each measurement may not have been performed yet). We seek search algorithms thatonly consider a small fraction of this huge space. In this section, we will study two generalapproaches to dealing with this combinatorial explosion of states: systematic search usingthe AO(cid:3) algorithm and various greedy search algorithms.4.1 Systematic SearchWhen an MDP has a unique start state and no (directed) cycles, the space of policies canbe represented as an AND/OR graph (Qi, 1994; Washington, 1997; Hansen, 1998). AnAND/OR graph is a directed acyclic graph that alternates between two kinds of nodes:AND nodes and OR nodes. Each OR node represents a state s in the MDP state space.Each child of an OR node is an AND node that represents one possible action a executedin state s. Each child of an AND node is an OR node that represents a state s0 that resultsfrom executing action a in state s. Figure 3 shows an example of an AND/OR graph for adiabetes diagnosis problem with three tests (BMI, Glucose, and Insulin) and two diagnosisactions (Diabetes and Healthy).In our diagnostic setting, the root OR node corresponds to the starting state s0 = fg.Each OR node s has one AND child (s; xn) for each measurement action (test) xn that canbe executed in s. Each OR node could also have one child for each possible diagnosis actionfk that could be performed in s, but to save time and memory, we include only the onediagnosis action fk that has the minimum expected cost. We will denote this by fbest. Eachtime an OR node is created, an AND child for fbest is created immediately. This leaf ANDnode stores the action-value function Q(s; fbest) = C(s; fbest). Note that multiple pathsfrom the root may lead to the same OR node, by changing the order of the tests.In our implementation, each OR node stores a representation of the state s, a currentpolicy (cid:25)(s) which speci(cid:12)es a test or a diagnosis action, and a current value function V (cid:25)(s).Each AND node (s; xn) stores a probability distribution over the outcomes of xn, and anaction-value function Q(cid:25)(s; xn), the expected cost of measuring xn and then continuingwith policy (cid:25).Every possible policy (cid:25) corresponds to a subtree of the full AND/OR graph. EachOR node s in this subtree (starting at the root) contains only the one AND child (s; a)corresponding to the action a = (cid:25)(s) chosen by policy (cid:25).The AO(cid:3) algorithm (Nilsson, 1980) computes the optimal policy for an AND/OR graph.AO(cid:3) is guided by a heuristic function. We describe the heuristic function in terms of state-272earning Diagnostic Policies from Examples
HDH largesmall AND nodeOR node highlowGDH Glucose{ } InsulinHealthy Diabetes BMI DAND OROROROR ANDAND highlowhighlow ORGDHOROR OR AND AND AND AND node AND nodeIGIG DHDHDHDH
Figure 3: An example of an AND/OR graph. The root OR node corresponds to the states0 = fg. There is a child AND node for each of the test actions (BMI, Glucoseand Insulin), and also for the diagnosis actions (Healthy and Diabetes). Thechoice of the BMI test in the root node leads to the AND node (s0; BM I), whichspeci(cid:12)es the expectation over the outcomes of the test BMI (small and large).If BMI is small, the child of AND node (s0; BM I) is the OR node with statefBM I = smallg; in this OR node, there is a choice among the actions Healthy,Diabetes, Glucose and Insulin.action pairs, h(s; a), instead of in terms of states. The heuristic function is admissible ifh(s; a) (cid:20) Q(cid:3)(s; a) for all states s and actions a. This means that h underestimates thetotal cost of executing action a in state s and following the optimal policy afterwards. Theadmissible heuristic allows the AO(cid:3) algorithm to safely ignore an action a0 if there is anotheraction a for which it is known that Q(cid:3)(s; a) < h(s; a0). Under these conditions, (s; a0) cannotbe part of any optimal policy.The AO(cid:3) search begins with an AND/OR graph containing only the root node. It thenrepeats the following steps: In the current best policy, it selects an AND node to expand; itexpands it (expanding an AND node creates its children OR nodes); and then it recomputes(bottom-up) the optimal value function and policy of the revised graph. The algorithmterminates when the best policy has no unexpanded AND nodes (in other words, the leaf273ayer-Zubek & Dietterich s’vxs to evaluate opt use h
Figure 4: Qopt(s; x) for unexpanded AND node (s; x) is computed using one-step lookaheadand hopt to evaluate the resulting states s0. x is an attribute not yet measured instate s, and v is one of its values.OR nodes of the policy specify diagnosis actions, so this policy is a complete diagnosticpolicy).During AO(cid:3) search, we maintain two policies, whose actions and value functions arestored in the nodes of the AND/OR graph. We call the (cid:12)rst policy the optimistic policy,(cid:25)opt. As we show below, its value function V opt is a lower bound on the optimal valuefunction V (cid:3). This is the policy that appears in Nilsson's original description of AO(cid:3), andit provides enough information to compute an optimal policy (cid:25)(cid:3) (Martelli & Montanari,1973). During the search, the optimistic policy (cid:25)opt is an incomplete policy, because itincludes some unexpanded AND nodes; when (cid:25)opt becomes a complete policy, it is in factan optimal policy.The second policy that we maintain is called the realistic policy, (cid:25)real. We will showthat its value function, V real, is an upper bound on the optimal value function V (cid:3). Therealistic policy is always a complete policy, so it is executable after each iteration of AO(cid:3).By maintaining the realistic policy, AO(cid:3) becomes an anytime algorithm.We now de(cid:12)ne these two policies in more detail and introduce our admissible heuristic.4.1.1 Admissible HeuristicOur admissible heuristic provides an optimistic estimate, Qopt(s; x), of the expected costof an unexpanded AND node (s; x). It is based on an incomplete two-step lookahead search(see Figure 4). The (cid:12)rst step of the lookahead search computes Qopt(s; x) = C(s; x) +Ps0 Ptr(s0js; x) (cid:1) hopt(s0). Here s0 iterates over the states resulting from measuring test x.The second step of the lookahead is de(cid:12)ned by the function hopt(s0) = mina02A(s0) C(s0; a0);which is the minimum over the cost of the diagnosis action fbest and the cost of each of theremaining tests x0 in s0. That is, rather than considering the states s00 that would result frommeasuring x0, we only consider the cost of measuring x0 itself. It follows immediately thathopt(s0) (cid:20) V (cid:3)(s0); 8s0, because C(s0; x0) (cid:20) Q(cid:3)(s0; x0) = C(s0; x0) + Ps00 Ptr(s00js0; x0) (cid:1) V (cid:3)(s00).The key thing to notice is that the cost of a single measurement x0 is less than or equal tothe cost of any policy that begins by measuring x0, because the policy must pay the costof at least one more action (diagnosis or measurement) before entering the terminal statesf . Consequently, Qopt(s; x) (cid:20) Q(cid:3)(s; x), so Qopt is an admissible heuristic for state s andaction x. 274earning Diagnostic Policies from Examples4.1.2 Optimistic Values and Optimistic PolicyThe de(cid:12)nition of the optimistic action-value value Qopt can be extended to all AND nodesin the AND/OR graph through the following recursion:Qopt(s; a) = 8><>: C(s; a), if a = fk (a diagnosis action)C(s; a) + Ps0 Ptr(s0js; a) (cid:1) hopt(s0), if (s; a) is unexpandedC(s; a) + Ps0 Ptr(s0js; a) (cid:1) V opt(s0), if (s; a) is expanded, (3)where V opt(s) def= mina2A(s) Qopt(s; a). Recall that A(s) consists of all attributes not yetmeasured in s and all diagnosis actions.The optimistic policy is (cid:25)opt(s) = argmina2A(s) Qopt(s; a): Every OR node s stores itsoptimistic value V opt(s) and policy (cid:25)opt(s), and every AND node (s; a) stores Qopt(s; a).Theorem 4.1 proves that Qopt and V opt form an admissible heuristic. The proofs for alltheorems in this paper appear in the thesis of Bayer-Zubek (2003).Theorem 4.1 For all states s and all actions a 2 A(s), Qopt(s; a) (cid:20) Q(cid:3)(s; a); and V opt(s) (cid:20)V (cid:3)(s):4.1.3 Realistic Values and Realistic PolicyIn the current graph constructed by AO(cid:3), suppose that we delete all unexpanded ANDnodes (s; a). We call the resulting graph the realistic graph, because every leaf node willselect a diagnosis action. The optimal policy computed from this graph is called the realisticpolicy, (cid:25)real. It is a complete policy leaves specify diagnosis actions of minimum expectedmisdiagnosis cost.Every OR node s stores the realistic value V real(s) and policy (cid:25)real(s), and every ANDnode (s; a) stores a realistic action-value value, Qreal(s; a). For a 2 A(s), de(cid:12)neQreal(s; a) = 8><>: C(s; a), if a = fk (a diagnosis action)C(s; a) + Ps0 Ptr(s0js; a) (cid:1) V real(s0), if (s; a) is expandedignore, if (s; a) is unexpanded (4)and V real(s) = mina2A0(s) Qreal(s; a); where the set A0(s) is A(s) without the unex-panded actions. The realistic policy is (cid:25)real(s) = argmina2A0(s) Qreal(s; a):Theorem 4.2 The realistic value function V real is an upper bound on the optimal valuefunction: V (cid:3)(s) (cid:20) V real(s); 8s:4.1.4 Selecting a Node for ExpansionIn the current optimistic policy (cid:25)opt, we choose to expand the unexpanded AND node(s; (cid:25)opt(s)) with the largest impact on the root node. This is de(cid:12)ned asargmaxs [V real(s) (cid:0) V opt(s)] (cid:1) Preach(sj(cid:25)opt);where Preach(sj(cid:25)opt) is the probability of reaching state s from the start state while followingpolicy (cid:25)opt. The di(cid:11)erence V real(s) (cid:0) V opt(s) is an upper bound on how much the valueof state s could change if (cid:25)opt(s) is expanded.275ayer-Zubek & DietterichThe rationale for this selection is based on the observation that AO(cid:3) terminates whenV opt(s0) = V real(s0). Therefore, we want to expand the node that makes the biggest steptoward this goal.4.1.5 Our Implementation of AO(cid:3) (High Level)Our implementation of AO(cid:3) is the following:repeatselect an AND node (s; a) to expand (using (cid:25)opt; V opt; V real).expand (s; a).do bottom-up updates of Qopt; V opt; (cid:25)opt and of Qreal; V real; (cid:25)real.until there are no unexpanded nodes reachable by (cid:25)opt.The updates of value functions are based on one-step lookaheads (Equations 3 and 4), us-ing the value functions of the children. At each iteration, we start from the newly expandedAND node (s; a), compute its Qopt(s; a) and Qreal(s; a), then compute V opt(s); (cid:25)opt(s),V real(s); and (cid:25)real(s) in its parent OR node, and propagate these changes up in theAND/OR graph all the way to the root. Full details on our implementation of AO(cid:3) appearin the thesis of Bayer-Zubek (2003).As more nodes are expanded, the optimistic values V opt increase, becoming tighter lowerbounds to the optimal values V (cid:3), and the realistic values V real decrease, becoming tighterupper bounds. V opt and V real converge to the value of the optimal policy: V opt(s) =V real(s) = V (cid:3)(s), for all states s reached by (cid:25)(cid:3).The admissible heuristic avoids exploring expensive parts of the AND/OR graph; indeed,when V real(s) < Qopt(s; a), action a does not need to be expanded (this is a heuristiccuto(cid:11)). Initially, V real(s) = C(s; fbest), and this explains why measurement costs that arelarge relative to misdiagnosis costs produce many cuto(cid:11)s.4.2 Greedy SearchNow that we have considered the AO(cid:3) algorithm for systematic search, we turn our attentionto several greedy search algorithms for (cid:12)nding good diagnostic policies. Greedy searchalgorithms grow a decision tree starting at the root, with state s0 = fg. Each node in thetree corresponds to a state s in the MDP, and it stores the corresponding action a = (cid:25)(s)chosen by the greedy algorithm. The children of node s correspond to the states that resultfrom executing action a in state s. If a diagnosis action fk is chosen in state s, then thenode has no children in the decision tree (it is a leaf node).All of the greedy algorithms considered in this paper share the same general template,which is shown as pseudo-code in Table 1. At each state s, the greedy algorithm performsa limited lookahead search and then commits to the choice of an action a to execute ins, which thereby de(cid:12)nes (cid:25)(s) = a. It then generates child nodes corresponding to thestates that could result from executing action a in state s. The algorithm is then invokedrecursively on each of these child nodes.Once a greedy algorithm has committed to xn = (cid:25)(s), that choice is (cid:12)nal. Note however,that some of our regularization methods may prune the policy by replacing a measurementaction (and its descendents) with a diagnosis action. In general, greedy policies are notoptimal, because they do not perform a complete analysis of the expected total cost of276earning Diagnostic Policies from ExamplesTable 1: The Greedy search algorithm. Initially, the function Greedy() is called with thestart state s0.function Greedy(state s) returns a policy (cid:25) (in the form of a decision tree).(1) if (stopping conditions are not met)(2) select measurement action xn to executeset (cid:25)(s) := xnfor each resulting value vn of the test xn add the subtreeGreedy(state s [ fxn = vng)else(3) select diagnosis action fbest, set (cid:25)(s) := fbest:executing xn in s before committing to an action. Nevertheless, they are e(cid:14)cient becauseof their greediness.In the following discussion, we describe several di(cid:11)erent greedy algorithms. We de(cid:12)neeach one by describing how it re(cid:12)nes the numbered lines in the template of Table 1.4.2.1 InfoGainCost MethodsInfoGainCost methods are inspired by the C4.5 algorithm for constructing classi(cid:12)cationtrees (Quinlan, 1993). C4.5 chooses the attribute xn with the highest conditional mutualinformation with the class labels in the training examples. In our diagnostic setting, theanalogous criterion is to choose the measurement action that is most predictive of the correctdiagnosis. Speci(cid:12)cally, let xn be a proposed measurement action, and de(cid:12)ne P (xn; yjs) tobe the joint distribution of xn and the correct diagnosis y conditioned on the informationthat has already been collected in state s. The conditional mutual information between xnand y, I(xn; yjs), is de(cid:12)ned asI(xn; yjs) = H(yjs) (cid:0) H(yjs; xn)= H(yjs) (cid:0) Xvn P (xn = vnjs) (cid:1) H(yjs [ fxn = vng)where H(y) = Py (cid:0)P (y) log P (y) is the Shannon entropy of random variable y.The mutual information is also called the information gain, because it quanti(cid:12)es theaverage amount of information we gain about y by measuring xn.The InfoGainCost methods penalize the information gain by dividing it by the cost of thetest. Speci(cid:12)cally, they choose the action xn that maximizes I(xn; yjs)=C(xn). This criterionwas introduced by Norton (1989). Other researchers have considered various monotonictransformations of the information gain prior to dividing by the measurement cost (Tan,1993; Nunez, 1991). This de(cid:12)nes step (2) of the algorithm template.All of the InfoGainCost methods employ the stopping conditions de(cid:12)ned in C4.5. The(cid:12)rst stopping condition applies if P (yjs) is 1 for some value y = k. In this case, thediagnosis action is chosen to be fbest = k. The second stopping condition applies if no more277ayer-Zubek & Dietterichmeasurement actions are available (i.e., all tests have been performed). In this case, thediagnosis action is set to the most likely diagnosis: fbest := argmaxy P (yjs).Notice that the InfoGainCost methods do not make any use of the misdiagnosis costsM C(fk; y).4.2.2 Modified InfoGainCost Methods (MC+InfoGainCost)We propose extending the InfoGainCost methods so that they consider misdiagnosis costsin the stopping conditions. Speci(cid:12)cally, in step (3), the MC+InfoGainCost methods setfbest to be the diagnosis action with minimum expected cost:(cid:25)(s) := fbest = argminfk Xy P (yjs) (cid:1) M C(fk; y):4.2.3 One-step Value of Information (VOI)While the previous greedy methods either ignore the misdiagnosis costs or only considerthem when choosing the (cid:12)nal diagnosis actions, the VOI approach considers misdiagnosiscosts (and measurement costs) at each step.Traditionally, the value of information of a measurement is de(cid:12)ned as the di(cid:11)erencebetween the expected value of the best action after performing the measurement and theexpected value of the best action before performing the measurement. Since our objective iscost minimization, we need to reverse the sign in the above de(cid:12)nition. However, we still keepthe notation VOI instead of cost of information. Instead of taking into account all futuredecisions, we make a greedy approximation to VOI, called one-step VOI, in which we onlyconsider the cost of the best diagnosis action before and after performing the measurementxn in state s:1-step-VOI(s; xn) = minfk Xy P (yjs) (cid:1) M C(fk; y)(cid:0) Xvn P (xn = vnjs) (cid:2) "minfk Xy P (yjs [ fxn = vng) (cid:1) M C(fk; y) V a l ue o f r ea li s t i c po li cy iteration AO* on training dataAO* on test data Figure 5: Illustration of AO(cid:3) over(cid:12)tting. This anytime graph shows that the best realisticpolicy, according to the test data, was discovered after 350 iterations, after whichAO(cid:3) over(cid:12)ts.maximum likelihood estimate for P (xn = vnjs) is^P (xn = vnjs) = V a l ue o f r ea li s t i c po li cy iteration AO* on training dataAO* on test dataAO*-L on training dataAO*-L on test data Figure 6: Anytime graphs of AO(cid:3) and AO(cid:3) with Laplace correction. The Laplace regularizerhelps AO(cid:3), both in the anytime graph and in the value of the last policy learned.For example, if AO(cid:3) believes that P (xn = vnjs) = 0, then it will not expand this branchfurther in the tree. Even more serious, if AO(cid:3) believes that P (y = cjs) = 0, then it will notconsider the potential misdiagnosis cost M C(fk; y = c) when computing the expected costsof diagnosis actions fk in state s.Figure 6 shows that AO(cid:3) with the Laplace regularizer gives worse performance on thetraining data but better performance on the test data than AO(cid:3). Despite this improvement,AO(cid:3) with Laplace still over(cid:12)ts: a better policy that was learned early on is discarded laterfor a worse one.5.1.2 Statistical Pruning (SP)Our second regularization technique reduces the size of the AO(cid:3) search space by pruningsubtrees that are unlikely to improve the current realistic policy.The statistical motivation is the following: given a small training data sample, there aremany pairs of diagnostic policies that are statistically indistinguishable. Ideally, we wouldlike to prune all policies in the AND/OR graph that are statistically indistinguishable fromthe optimal policies. Since this is not possible without (cid:12)rst expanding the graph, we needa heuristic that approximately implements the following indi(cid:11)erence principle:282earning Diagnostic Policies from Examples state s optimistic policyunexpanded realisticpolicyrealVVopt