Geoffrey I. Webb | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Geoffrey I. Webb is active.

Explore More

Publication

Featured researches published by Geoffrey I. Webb.

Machine Learning | 2007

Discovering significant patterns

Geoffrey I. Webb

AbstractnPattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.n

Journal of Artificial Intelligence Research | 1995

OPUS: an efficient admissible algorithm for unordered search

Geoffrey I. Webb

OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithms search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.

knowledge discovery and data mining | 2000

Efficient search for association rules

Geoffrey I. Webb

This paper argues that for some applications direct search for association rules can be more e cient than the tw o stage process of the Apriori algorithm which rst nds large itemsets whic hare then used to iden tify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequen titemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented b y other customer information. An algorithm is presented that is computationally e cient for association rule analyses during which the n umber of rules to be found can be constrained and all data can be maintained in memory.

Data Mining and Knowledge Discovery | 2005

K-Optimal Rule Discovery

Geoffrey I. Webb; Songmao Zhang

K-optimal rule discovery finds the K rules that optimize a user-specified measure of rule value with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of K-optimal rule discovery tasks and demonstrates its efficiency.

knowledge discovery and data mining | 2003

On detecting differences between groups

Geoffrey I. Webb; Shane M. Butler; Douglas A. Newlands

Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, contrast-set mining. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion.

knowledge discovery and data mining | 2006

Discovering significant rules

Geoffrey I. Webb

In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experiment wise risk of false discoveries.

knowledge discovery and data mining | 2001

Discovering associations with numeric variables

Geoffrey I. Webb

This paper further develops Aumann and Lindells [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindells algorithm is infeasible.

european conference on principles of data mining and knowledge discovery | 2002

The Need for Low Bias Algorithms in Classification Learning from Large Data Sets

Damien Brain; Geoffrey I. Webb

This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelisation have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm - the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management.

Bioinformatics | 2015

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome

Fuyi Li; Chen Li; Mingjun Wang; Geoffrey I. Webb; Yang Zhang; James C. Whisstock; Jiangning Song

MOTIVATIONnGlycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM.nnnRESULTSnIn this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.nnnAVAILABILITY AND IMPLEMENTATIONnThe webserver, Java Applet, user instructions, datasets, and predicted glycosylation sites in the human proteome are freely available at http://www.structbioinfor.org/Lab/GlycoMine/[email protected] or [email protected] or [email protected] INFORMATIONnSupplementary data are available at Bioinformatics online.

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2011

Filtered-top-k association discovery

Geoffrey I. Webb

Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top‐k techniques coupled with appropriate statistical and other filtering.

Explore More