Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Thomas G. Dietterich is active.

Publication


Featured researches published by Thomas G. Dietterich.


multiple classifier systems | 2000

Ensemble Methods in Machine Learning

Thomas G. Dietterich

Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.


Neural Computation | 1998

Approximate statistical tests for comparing supervised classification learning algorithms

Thomas G. Dietterich

This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These test sare compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemars test, is shown to have low type I error. The fifth test is a new test, 5 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 52 cv test is shown to be slightly more powerful than McNemars test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, Mc-Nemars test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.


Machine Learning | 2000

An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Thomas G. Dietterich

Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a “base” learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.


Artificial Intelligence | 1997

Solving the multiple instance problem with axis-parallel rectangles

Thomas G. Dietterich; Richard H. Lathrop; Tomás Lozano-Pérez

The multiple instance problem arises in tasks where the training examples are ambiguous: a single example object may have many alternative feature vectors (instances) that describe it, and yet only one of those feature vectors may be responsible for the observed classification of the object. This paper describes and compares three kinds of algorithms that learn axis-parallel rectangles to solve the multiple instance problem. Algorithms that ignore the multiple instance problem perform very poorly. An algorithm that directly confronts the multiple instance problem (by attempting to identify which feature vectors are responsible for the observed classifications) performs best, giving 89% correct predictions on a musk odor prediction task. The paper also illustrates the use of artificial data to debug and compare these algorithms.


Journal of Artificial Intelligence Research | 2000

Hierarchical reinforcement learning with the MAXQ value function decomposition

Thomas G. Dietterich

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics--as a subroutine hierarchy--and a declarative semantics--as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this nonhierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.


Artificial Intelligence | 1994

Learning Boolean concepts in the presence of many irrelevant features

Hussein Almuallim; Thomas G. Dietterich

Abstract In many domains, an appropriate inductive bias is the MIN-FEATURES bias, which prefers consistent hypotheses definable over as few features as possible. This paper defines and studies this bias in Boolean domains. First, it is shown that any learning algorithm implementing the MIN-FEATURES bias requires ⊖(( ln ( l δ ) + [2 p + p ln n])/e) training examples to guarantee PAC-learning a concept having p relevant features out of n available features. This bound is only logarithmic in the number of irrelevant features. For implementing the MIN-FEATURES bias, the paper presents five algorithms that identify a subset of features sufficient to construct a hypothesis consistent with the training examples. FOCUS-1 is a straightforward algorithm that returns a minimal and sufficient subset of features in quasi-polynomial time. FOCUS-2 does the same task as FOCUS-1 but is empirically shown to be substantially faster than FOCUS-1. Finally, the Simple-Greedy, Mutual-Information-Greedy and Weighted-Greedy algorithms are three greedy heuristics that trade optimality for computational efficiency. Experimental studies are presented that compare these exact and approximate algorithms to two well-known algorithms, ID3 and FRINGE, in learning situations where many irrelevant features are present. These experiments show that—contrary to expectations—the ID3 and FRINGE algorithms do not implement good approximations of MIN-FEATURES. The sample complexity and generalization performance of the FOCUS algorithms is substantially better than either ID3 or FRINGE on learning problems where the MIN-FEATURES bias is appropriate. These experiments also show that, among our three heuristics, the Weighted-Greedy algorithm provides an excellent approximation to the FOCUS algorithms.


Ai Magazine | 1997

Machine-Learning Research

Thomas G. Dietterich

Machine-learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (1) the improvement of classification accuracy by learning ensembles of classifiers, (2) methods for scaling up supervised learning algorithms, (3) reinforcement learning, and (4) the learning of complex stochastic models.


Lecture Notes in Computer Science | 2002

Machine Learning for Sequential Data: A Review

Thomas G. Dietterich

Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer networks. The paper also discusses some open research issues.


Ai Edam Artificial Intelligence for Engineering Design, Analysis and Manufacturing | 1988

A model of the mechanical design process based on empirical data

David G. Ullman; Thomas G. Dietterich; Larry A. Stauffer

This paper describes the task/episode accumulation model (TEA model) of non-routine mechanical design, which was developed after detailed analysis of the audio and video protocols of five mechanical designers. The model is able to explain the behavior of designers at a much finer level of detail than previous models. The key features of the model are (a) the design is constructed by incrementally refining and patching an initial conceptual design, (b) design alternatives are not considered outside the boundaries of design episodes (which are short stretches of problem solving aimed at specific goals), (c) the design process is controlled locally, primarily at the level of individual episodes. Among the implications of the model are the following: (a) CAD tools should be extended to represent the state of the design at more abstract levels, (b) CAD tools should help the designer manage constraints, and (c) CAD tools should be designed to give cognitive support to the designer.


Machine Learning#R##N#An Artificial Intelligence Approach, Volume I | 1983

A COMPARATIVE REVIEW OF SELECTED METHODS FOR LEARNING FROM EXAMPLES

Thomas G. Dietterich; Ryszard S. Michalski

Research in the area of learning structural descriptions from examples is reviewed, giving primary attention to methods of learning characteristic descriptions of single concepts. In particular, we examine methods for finding the maximally-specific conjunctive generalizations (MSC-generalizations) that cover all of the training examples of a given concept. Various important aspects of structural learning in general are examined, and several criteria for evaluating structural learning methods are presented. Briefly, these criteria include (i) adequacy of the representation language, (ii) generalization rules employed, (iii) computational efficiency, and (iv) flexibility and extensibility. Selected learning methods developed by Buchanan, et al., Hayes-Roth, Vere, Winston, and the authors are analyzed according to these criteria. Finally, some goals are suggested for future research.

Collaboration


Dive into the Thomas G. Dietterich's collaboration.

Researchain Logo
Decentralizing Knowledge