David J. Hand | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David J. Hand is active.

Explore More

Publication

Featured researches published by David J. Hand.

Knowledge and Information Systems | 2007

Top 10 algorithms in data mining

Xindong Wu; Vipin Kumar; J. Ross Quinlan; Joydeep Ghosh; Qiang Yang; Hiroshi Motoda; Geoffrey J. McLachlan; Angus F. M. Ng; Bing Liu; Philip S. Yu; Zhi-Hua Zhou; Michael Steinbach; David J. Hand; Dan Steinberg

This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.

Machine Learning | 2001

A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

David J. Hand; Robert Till

The area under the ROC curve, or the equivalent Gini index, is a widely used measure of performance of supervised classification rules. It has the attractive property that it side-steps the need to specify the costs of the different kinds of misclassification. However, the simple form is only applicable to the case of two classes. We extend the definition to the case of more than two classes by averaging pairwise comparisons. This measure reduces to the standard form in the two class case. We compare its properties with the standard measure of proportion correct and an alternative definition of proportion correct based on pairwise comparison of classes for a simple artificial case and illustrate its application on eight data sets. On the data sets we examined, the measures produced similar, but not identical results, reflecting the different aspects of performance that they were measuring. Like the area under the ROC curve, the measure we propose is useful in those many situations where it is impossible to give costs for the different kinds of misclassification.

Statistical Science | 2006

Classifier Technology and the Illusion of Progress

David J. Hand

A great many tools have been developed for supervised clas- sification, ranging from early methods such as linear discriminant anal- ysis through to modern developments such as neural networks and sup- port vector machines. A large number of comparative studies have been conducted in attempts to establish the relative superiority of these methods. This paper argues that these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illu- sion. In particular, simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the di!erence in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm.

The American Statistician | 1998

Data Mining: Statistics and More?

David J. Hand

Abstract Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners. New problems arise, partly as a consequence of the sheer size of the data sets involved, and partly because of issues of pattern matching. However, since statistics provides the intellectual glue underlying the effort, it is important for statisticians to become involved. There are very real opportunities for statisticians to make significant contributions.

Quality of Life Research | 1997

Factor analysis, causal indicators and quality of life

Peter Fayers; David J. Hand

Exploratory factor analysis (EFA) remains one of the standard and most widely used methods for demonstrating construct validity of new instruments. However, the model for EFA makes assumptions which may not be applicable to all quality of life (QOL) instruments, and as a consequence the results from EFA may be misleading. In particular, EFA assumes that the underlying construct of QOL (and any postulated subscales or ‘factors’) may be regarded as being reflected by the items in those factors or subscales. QOL instruments, however, frequently contain items such as diseases, symptoms or treatment side effects, which are ‘causal indicators.’ These items may cause reduction in QOL for those patients experiencing them, but the reverse relationship need not apply: not all patients with a poor QOL need be experiencing the same set of symptoms. Thus a high level of a symptom item may imply that a patients QOL is likely to be poor, but a poor level of QOL need not imply that the patient probably suffers from that symptom. This is the reverse of the common EFA model, in which it is implicitly assumed that changes in QOL and any subscales ‘cause’ or are likely to be reflected by corresponding changes in all their constituent items; thus the items in EFA are called ‘effect indicators.’ Furthermore, disease-related clusters of symptoms, or treatment-induced side-effects, may result in different studies finding different sets of items being highly correlated; for example, a study involving lung cancer patients receiving surgery and chemotherapy might find one set of highly correlated symptoms, whilst prostate cancer patients receiving hormone therapy would have a very different symptom correlation structure. Since EFA is based upon analyzing the correlation matrix and assuming all items to be effect indicators, it will extract factors representing consequences of the disease or treatment. These factors are likely to vary between different patient subgroups, according to the mode of treatment or the disease type and stage. Such factors contain little information about the relationship between the items and any underlying QOL constructs. Factor analysis is largely irrelevant as a method of scale validation for those QOL instruments that contain causal indicators, and should only be used with items which are effect indicators.

intelligent data analysis | 2000

Advances in intelligent data analysis

David J. Hand; Douglas H. Fisher; Michael R. Berthold

aDepartment of Mathematics, Imperial College, 180 Queen’s Gate, London, SW7 2BZ, UK E-mail: [email protected]; URL: http://www.ma.ic.ac.uk/statistics/djhand.html bDepartment of Computer Science, Box 1679, Station B, Vanderbilt University, Nashville, TN 37235, USA E-mail: [email protected]; URL: http://cswww.vuse.vanderbilt.edu/ ̃dfisher/ cBerkeley Initiative in Soft Computing (BISC), Department of EECS, CS Division, University of California, Berkeley, CA 94720, USA E-mail: [email protected]; URL: http://www.cs.berkeley.edu/ ̃berthold

Quality of Life Research | 1997

Causal indicators in quality of life research.

Peter Fayers; David J. Hand; Kristin Bjordal; Mogens Groenvold

Quality of Life (QOL) questionnaires contain two different types of items. Some items, such as assessments of symptoms of disease, may be called causal indicators because the occurrence of these symptoms can cause a change in QOL. A severe state of even a single symptom may suffice to cause impairment of QOL, although a poor QOL need not necessarily imply that a patient suffers from all the symptoms. Other items, for example anxiety and depression, can be regarded as effect indicators which reflect the level of QOL. These indicators usually have a more uniform relationship with QOL, and therefore a patient with poor QOL is likely to have low scores on all effect indicators. In extreme cases it may seem intuitively obvious which items are causal and which are effect indicators, but often it is less clear. We propose a model which includes these two types of indicators and show that they behave in markedly different ways. Formal quantitative methods are developed for distinguishing them. We also discuss the impact of this distinction upon instrument validation and the design and analysis of summary subscales.

The Statistician | 1996

A k-nearest-neighbour classifier for assessing consumer credit risk

William Henley; David J. Hand

SUMMARY The last 30 years have seen the development of credit scoring techniques for assessing the creditworthiness of consumer loan applicants. Traditional credit scoring methodology has involved the use of techniques such as discriminant analysis, linear or logistic regression, linear programming and decision trees. In this paper we look at the application of the k-nearest-neighbour (k-NN) method, a standard technique in pattern recognition and nonparametric statistics, to the credit scoring problem. We propose an adjusted version of the Euclidean distance metric which attempts to incorporate knowledge of class separation contained in the data. Our k-NN methodology is applied to a real data set and we discuss the selection of optimal values of the parameters k and D included in the method. To assess the potential of the method we make comparisons with linear and logistic regression and decision trees and graphs. We end by discussing a practical implementation of the proposed k-NN classifier.

Journal of The Royal Statistical Society Series A-statistics in Society | 2002

Causal variables, indicator variables and measurement scales: an example from quality of life

Peter Fayers; David J. Hand

There is extensive literature on the development and validation of multi‐item measurement scales. Much of this is based on principles derived from psychometric theory and assumes that the individual items form parallel tests, so that simple weighted or unweighted summation is an appropriate method of aggregation. More recent work either continues to promulgate these methods or places emphasis on modern techniques centred on item response theory. In fact, however, clinical measuring instruments often have different underlying principles, so adopting such approaches is inappropriate. We illustrate, using health‐related quality of life, that clinimetric and psychometric ideas need to be combined to yield a suitable measuring instrument. We note the fundamental distinction between indicator and causal variables and propose that this distinction suffices to explain fully the need for both clinimetric and psychometric techniques, and identifies their respective roles in scale development, validation and scoring.

knowledge discovery and data mining | 1999

The impact of changing populations on classifier performance

Mark Kelly; David J. Hand; Niall M. Adams

An assumption fundamental to almost all work on supervised classification is that the probabilities of class membership, conditional on the feature vectors, are stationary. However, in many situations this assumption is untenable. We give examples of such population drift, examine its nature, show how the impact of population drift depends on the chosen measure of classification performance, and propose a strategy for dynamically updating classification rules.

Explore More