Peter Willett | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peter Willett is active.

Explore More

Publication

Featured researches published by Peter Willett.

Journal of Chemical Information and Computer Sciences | 1998

Chemical Similarity Searching

Peter Willett; John M. Barnard and; Geoffrey M. Downs

This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching, differentiating it from the more common substructure searching, and then discusses the current generation of fragment-based measures that are used for searching chemical structure databases. The next sections focus upon two of the principal characteristics of a similarity measure: the coefficient that is used to quantify the degree of structural resemblance between pairs of molecules and the structural representations that are used to characterize molecules that are being compared in a similarity calculation. New types of similarity measure are then compared with current approaches, and examples are given of several applications that are related to similarity searching.

Information Processing and Management | 1988

Recent trends in hierarchic document clustering: a critical review

Peter Willett

Abstract This article reviews recent research into the use of hierarchic agglomerative clustering methods for document retrieval. After an introduction to the calculation of interdocument similarities and to clustering methods that are appropriate for document clustering, the article discusses algorithms that can be used to allow the implementation of these methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented of a series of research projects that have used these strategies to search the clusters resulting from several different types of hierarchic agglomerative clustering method. It is suggested that the complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.

Journal of Computer-aided Molecular Design | 2002

Maximum common subgraph isomorphism algorithms for the matching of chemical structures

John W. Raymond; Peter Willett

The maximum common subgraph (MCS) problem has become increasingly important in those aspects of chemoinformatics that involve the matching of 2D or 3D chemical structures. This paper provides a classification and a review of the many MCS algorithms, both exact and approximate, that have been described in the literature, and makes recommendations regarding their applicability to typical chemoinformatics tasks.

Journal of Computer-aided Molecular Design | 1995

A genetic algorithm for flexible molecular overlay and pharmacophore elucidation

Gareth Jones; Peter Willett; Robert C. Glen

SummaryA genetic algorithm (GA) has been developed for the superimposition of sets of flexible molecules. Molecules are represented by a chromosome that encodes angles of rotation about flexible bonds and mappings between hydrogen-bond donor proton, acceptor lone pair and ring centre features in pairs of molecules. The molecule with the smallest number of features in the data set is used as a template, onto which the remaining molecules are fitted with the objective of maximising structural equivalences. The fitness function of the GA is a weighted combination of: (i) the number and the similarity of the features that have been overlaid in this way; (ii) the volume integral of the overlay; and (iii) the van der Waals energy of the molecular conformations defined by the torsion angles encoded in the chromosomes. The algorithm has been applied to a number of pharmacophore elucidation problems, i.e., angiotensin II receptor antagonists, Leu-enkephalin and a hybrid morphine molecule, 5-HT1D agonists, benzodiazepine receptor ligands, 5-HT3 antagonists, dopamine D2 antagonists, dopamine reuptake blockers and FKBP12 ligands. The resulting pharmacophores are generated rapidly and are in good agreement with those derived from alternative means.

Journal of the Association for Information Science and Technology | 1991

The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems

Helen J. Peat; Peter Willett

Term cooccurrence data has been extensively used in document retrieval systems for the identification of indexing terms that are similar to those that have been specified in a user query: these similar terms can then be used to augment the original query statement. Despite the plausibility of this approach to query expansion, the retrieval effectiveness of the expanded queries is often no greater than, or even less than, the effectiveness of the unexpanded queries. This article demonstrates that the similar terms identified by cooccurrence data in a query expansion system tend to occur very frequently in the database that is being searched. Unfortunately, frequent terms tend to discriminate poorly between relevant and nonrelevant documents, and the general effect of query expansion is thus to add terms that do little or nothing to improve the discriminatory power of the original query.

The Computer Journal | 2002

RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs

John W. Raymond; Eleanor J. Gardiner; Peter Willett

A new graph similarity calculation procedure is introduced for comparing labeled graphs. Given a minimum similarity threshold, the procedure consists of an initial screening process to determine whether it is possible for the measure of similarity between the two graphs to exceed the minimum threshold, followed by a rigorous maximum common edge subgraph (MCES) detection algorithm to compute the exact degree and composition of similarity. The proposed MCES algorithm is based on a maximum clique formulation of the problem and is a significant improvement over other published algorithms. It presents new approaches to both lower and upper bounding as well as vertex selection.

Journal of Molecular Biology | 1990

Use of techniques derived from graph theory to compare secondary structure motifs in proteins

Eleanor M. Mitchell; Peter J. Artymiuk; David W. Rice; Peter Willett

A substructure matching algorithm is described that can be used for the automatic identification of secondary structural motifs in three-dimensional protein structures from the Protein Data Bank. The proteins and motifs are stored for searching as labelled graphs, with the nodes of a graph corresponding to linear representations of helices and strands and the edges to the inter-line angles and distances. A modification of Ullmans subgraph isomorphism algorithm is described that can be used to search these graph representations. Tests with patterns from the protein structure literature demonstrate both the efficiency and the effectiveness of the search procedure, which has been implemented in FORTRAN 77 on a MicroVAX-II system, coupled to the molecular fitting program FRODO on an Evans and Sutherland PS300 graphics system.

Journal of Chemical Information and Computer Sciences | 2004

Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures

Jérôme Hert; Peter Willett; David J. Wilton; Pierre Acklin; Kamal Azzaoui; Edgar Jacoby; Ansgar Schuffenhauer

Fingerprint-based similarity searching is widely used for virtual screening when only a single bioactive reference structure is available. This paper reviews three distinct ways of carrying out such searches when multiple bioactive reference structures are available: merging the individual fingerprints into a single combined fingerprint; applying data fusion to the similarity rankings resulting from individual similarity searches; and approximations to substructural analysis. Extended searches on the MDL Drug Data Report database suggest that fusing similarity scores is the most effective general approach, with the best individual results coming from the binary kernel discrimination technique.

Organic and Biomolecular Chemistry | 2004

Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures

Jérôme Hert; Peter Willett; David J. Wilton; Pierre Acklin; Kamal Azzaoui; Edgar Jacoby; Ansgar Schuffenhauer

This paper reports a detailed comparison of a range of different types of 2D fingerprints when used for similarity-based virtual screening with multiple reference structures. Experiments with the MDL Drug Data Report database demonstrate the effectiveness of fingerprints that encode circular substructure descriptors generated using the Morgan algorithm. These fingerprints are notably more effective than fingerprints based on a fragment dictionary, on hashing and on topological pharmacophores. The combination of these fingerprints with data fusion based on similarity scores provides both an effective and an efficient approach to virtual screening in lead-discovery programmes.

Program: Electronic Library and Information Systems | 2006

The Porter stemming algorithm: then and now

Peter Willett

Purpose – In 1980, Porter presented a simple algorithm for stemming English language words. This paper summarises the main features of the algorithm, and highlights its role not just in modern information retrieval research, but also in a range of related subject domains.Design/methodology/approach – Review of literature and research involving use of the Porter algorithm.Findings – The algorithm has been widely adopted and extended so that it has become the standard approach to word conflation for information retrieval in a wide range of languages.Orinality/value – The 1980 paper in Program by Porter describing his algorithm has been highly cited. This paper provides a context for the original paper as well as an overview of its subsequent use.

Explore More