Eibe Frank
University of Waikato
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Eibe Frank.
Sigkdd Explorations | 2009
Mark A. Hall; Eibe Frank; Geoffrey Holmes; Bernhard Pfahringer; Peter Reutemann; Ian H. Witten
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
international conference on management of data | 2002
Ian H. Witten; Eibe Frank
1. Whats It All About? 2. Input: Concepts, Instances, Attributes 3. Output: Knowledge Representation 4. Algorithms: The Basic Methods 5. Credibility: Evaluating Whats Been Learned 6. Implementations: Real Machine Learning Schemes 7. Moving On: Engineering The Input And Output 8. Nuts And Bolts: Machine Learning Algorithms In Java 9. Looking Forward
Machine Learning | 2011
Jesse Read; Bernhard Pfahringer; Geoff Holmes; Eibe Frank
The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multi-label datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.
acm international conference on digital libraries | 1999
Ian H. Witten; Gordon W. Paynter; Eibe Frank; Carl Gutwin; Craig G. Nevill-Manning
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machinelearning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.
Machine Learning | 2005
Niels Landwehr; Mark A. Hall; Eibe Frank
Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into ‘model trees’, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other state-of-the-art learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers.
Bioinformatics | 2004
Eibe Frank; Mark A. Hall; Leonard E. Trigg; Geoffrey Holmes; Ian H. Witten
UNLABELLED The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY http://www.cs.waikato.ac.nz/ml/weka.
discovery science | 2010
Albert Bifet; Eibe Frank
Micro-blogs are a challenging new source of information for data mining techniques. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. We briefly discuss the challenges that Twitter data streams pose, focusing on classification problems, and then consider these streams for opinion mining and sentiment analysis. To deal with streaming unbalanced classes, we propose a sliding window Kappa statistic for evaluation in time-changing data streams. Using this statistic we perform a study on Twitter data using learning algorithms for data streams.
european conference on machine learning | 2001
Eibe Frank; Mark A. Hall
Machine learning methods for classification problems commonly assume that the class values are unordered. However, in many practical applications the class values do exhibit a natural order--for example, when learning how to grade. The standard approach to ordinal classification converts the class value into a numeric quantity and applies a regression learner to the transformed data, translating the output back into a discrete class value in a post-processing step. A disadvantage of this method is that it can only be applied in conjunction with a regression scheme. In this paper we present a simple method that enables standard classification algorithms to make use of ordering information in class attributes. By applying it in conjunction with a decision tree learner we show that it outperforms the naive approach, which treats the class values as an unordered set. Compared to special-purpose algorithms for ordinal classification our method has the advantage that it can be applied without any modification to the underlying learning scheme.
Machine Learning | 1998
Eibe Frank; Yong Wang; Stuart J. Inglis; Geoffrey Holmes; Ian H. Witten
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5′, based on Quinlans M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.
Computational Biology and Chemistry | 2005
Yu Wang; Igor V. Tetko; Mark A. Hall; Eibe Frank; Axel Facius; Klaus F. X. Mayer; Hans-Werner Mewes
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naïve Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.