Frances M. G. Pearl
University College London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Frances M. G. Pearl.
Nucleic Acids Research | 2007
Lesley H. Greene; Tony E. Lewis; Sarah Addou; Alison L. Cuff; Timothy Dallman; Mark Dibley; Oliver Redfern; Frances M. G. Pearl; Rekha Nambudiry; Adam J. Reid; Ian Sillitoe; Corin Yeats; Janet M. Thornton; Christine A. Orengo
We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt.
Nucleic Acids Research | 2004
Frances M. G. Pearl; Annabel E. Todd; Ian Sillitoe; Mark Dibley; Oliver Redfern; Tony E. Lewis; Christopher G. Bennett; Russell L. Marsden; Alastair Grant; David A. Lee; Adrian Akpor; Michael Maibaum; Andrew P. Harrison; Timothy Dallman; Gabrielle A. Reeves; Ilhem Diboun; Sarah Addou; Stefano Lise; Caroline E. Johnston; Antonio Sillero; Janet M. Thornton; Christine A. Orengo
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath/) currently contains 43 229 domains classified into 1467 superfamilies and 5107 sequence families. Each structural family is expanded with sequence relatives from GenBank and completed genomes, using a variety of efficient sequence search protocols and reliable thresholds. This extended CATH protein family database contains 616 470 domain sequences classified into 23 876 sequence families. This results in the significant expansion of the CATH HMM model library to include models built from the CATH sequence relatives, giving a 10% increase in coverage for detecting remote homologues. An improved Dictionary of Homologous superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) containing specific sequence, structural and functional information for each superfamily in CATH considerably assists manual validation of homologues. Information on sequence relatives in CATH superfamilies, GenBank and completed genomes is presented in the CATH associated DHS and Gene3D resources. Domain partnership information can be obtained from Gene3D (http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/). A new CATH server has been implemented (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) providing automatic classification of newly determined sequences and structures using a suite of rapid sequence and structure comparison methods. The statistical significance of matches is assessed and links are provided to the putative superfamily or fold group to which the query sequence or structure is assigned.
Nucleic Acids Research | 2003
Frances M. G. Pearl; C. F. Bennett; James E. Bray; Andrew P. Harrison; Nigel J. Martin; Adrian J. Shepherd; Ian Sillitoe; Janet M. Thornton; Christine A. Orengo
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath_new) currently contains 34 287 domain structures classified into 1383 superfamilies and 3285 sequence families. Each structural family is expanded with domain sequence relatives recruited from GenBank using a variety of efficient sequence search protocols and reliable thresholds. This extended resource, known as the CATH-protein family database (CATH-PFDB) contains a total of 310 000 domain sequences classified into 26 812 sequence families. New sequence search protocols have been designed, based on these intermediate sequence libraries, to allow more regular updating of the classification. Further developments include the adaptation of a recently developed method for rapid structure comparison, based on secondary structure matching, for domain boundary assignment. The philosophy behind CATHEDRAL is the recognition of recurrent folds already classified in CATH. Benchmarking of CATHEDRAL, using manually validated domain assignments, demonstrated that 43% of domains boundaries could be completely automatically assigned. This is an improvement on a previous consensus approach for which only 10-20% of domains could be reliably processed in a completely automated fashion. Since domain boundary assignment is a significant bottleneck in the classification of new structures, CATHEDRAL will also help to increase the frequency of CATH updates.
Nucleic Acids Research | 2000
Frances M. G. Pearl; David A. Lee; James E. Bray; Ian Sillitoe; Annabel E. Todd; Andrew P. Harrison; Janet M. Thornton; Christine A. Orengo
We report the latest release (version 1.6) of the CATH protein domains database (http://www.biochem.ucl. ac.uk/bsm/cath ). This is a hierarchical classification of 18 577 domains into evolutionary families and structural groupings. We have identified 1028 homo-logous superfamilies in which the proteins have both structural, and sequence or functional similarity. These can be further clustered into 672 fold groups and 35 distinct architectures. Recent developments of the database include the generation of 3D templates for recognising structural relatives in each fold group, which has led to significant improvements in the speed and accuracy of updating the database and also means that less manual validation is required. We also report the establishment of the CATH-PFDB (Protein Family Database), which associates 1D sequences with the 3D homologous superfamilies. Sequences showing identifiable homology to entries in CATH have been extracted from GenBank using PSI-BLAST. A CATH-PSIBLAST server has been established, which allows you to scan a new sequence against the database. The CATH Dictionary of Homologous Superfamilies (DHS), which contains validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies, has been updated to include annotations associated with sequence relatives identified in GenBank. The DHS is a powerful tool for considering the variation of functional properties within a given CATH superfamily and in deciding what functional properties may be reliably inherited by a newly identified relative.
Nature Reviews Cancer | 2015
Laurence H. Pearl; Amanda C. Schierz; Simon E. Ward; Bissan Al-Lazikani; Frances M. G. Pearl
The DNA damage response (DDR) is essential for maintaining the genomic integrity of the cell, and its disruption is one of the hallmarks of cancer. Classically, defects in the DDR have been exploited therapeutically in the treatment of cancer with radiation therapies or genotoxic chemotherapies. More recently, protein components of the DDR systems have been identified as promising avenues for targeted cancer therapeutics. Here, we present an in-depth analysis of the function, role in cancer and therapeutic potential of 450 expert-curated human DDR genes. We discuss the DDR drugs that have been approved by the US Food and Drug Administration (FDA) or that are under clinical investigation. We examine large-scale genomic and expression data for 15 cancers to identify deregulated components of the DDR, and we apply systematic computational analysis to identify DDR proteins that are amenable to modulation by small molecules, highlighting potential novel therapeutic targets.
Journal of Molecular Biology | 2002
Andrew P. Harrison; Frances M. G. Pearl; Richard Mott; Janet M. Thornton; Christine A. Orengo
We have used GRATH, a graph-based structure comparison algorithm, to map the similarities between the different folds observed in the CATH domain structure database. Statistical analysis of the distributions of the fold similarities has allowed us to assess the significance for any similarity. Therefore we have examined whether it is best to represent folds as discrete entities or whether, in fact, a more accurate model would be a continuum wherein folds overlap via common motifs. To do this we have introduced a new statistical measure of fold similarity, termed gregariousness. For a particular fold, gregariousness measures how many other folds have a significant structural overlap with that fold, typically comprising 40% or more of the larger structure. Gregarious folds often contain commonly occurring super-secondary structural motifs, such as beta-meanders, greek keys, alpha-beta plait motifs or alpha-hairpins, which are matching similar motifs in other folds. Apart from one example, all the most gregarious folds matching 20% or more of the other folds in the database, are alpha-beta proteins. They also occur in highly populated architectural regions of fold space, adopting sandwich-like arrangements containing two or more layers of alpha-helices and beta-strands.Domains that exhibit a low gregariousness, are those that have very distinctive folds, with few common motifs or motifs that are packed in unusual arrangements. Most of the superhelices exhibit low gregariousness despite containing some commonly occurring super-secondary structural motifs. In these folds, these common motifs are combined in an unusual way and represent a small proportion of the fold (<10%). Our results suggest that fold space may be considered as continuous for some architectural arrangements (e.g. alpha-beta sandwiches), in that super-secondary motifs can be used to link neighbouring fold groups. However, in other regions of fold space much more discrete topologies are observed with little similarity between folds.
Bioinformatics | 2003
Andrew P. Harrison; Frances M. G. Pearl; Ian Sillitoe; Tim Slidel; Richard Mott; Janet M. Thornton; Christine A. Orengo
This paper reports a graph-theoretic program, GRATH, that rapidly, and accurately, matches a novel structure against a library of domain structures to find the most similar ones. GRATH generates distributions of scores by comparing the novel domain against the different types of folds that have been classified previously in the CATH database of structural domains. GRATH uses a measure of similarity that details the geometric information, number of secondary structures and number of residues within secondary structures, that any two protein structures share. Although GRATH builds on well established approaches for secondary structure comparison, a novel scoring scheme has been introduced to allow ranking of any matches identified by the algorithm. More importantly, we have benchmarked the algorithm using a large dataset of 1702 non-redundant structures from the CATH database which have already been classified into fold groups, with manual validation. This has facilitated introduction of further constraints, optimization of parameters and identification of reliable thresholds for fold identification. Following these benchmarking trials, the correct fold can be identified with the top score with a frequency of 90%. It is identified within the ten most likely assignments with a frequency of 98%. GRATH has been implemented to use via a server (http://www.biochem.ucl.ac.uk/cgi-bin/cath/Grath.pl). GRATHs speed and accuracy means that it can be used as a reliable front-end filter for the more accurate, but computationally expensive, residue based structure comparison algorithm SSAP, currently used to classify domain structures in the CATH database. With an increasing number of structures being solved by the structural genomics initiatives, the GRATH server also provides an essential resource for determining whether newly determined structures are related to any known structures from which functional properties may be inferred.
Proteomics | 2002
Christine A. Orengo; James E. Bray; Daniel W. A. Buchan; Andrew P. Harrison; David A. Lee; Frances M. G. Pearl; Ian Sillitoe; Annabel E. Todd; Janet M. Thornton
Over the last decade, there have been huge increases in the numbers of protein sequences and structures determined. In parallel, many methods have been developed for recognising similarities between these proteins, arising from their common evolutionary background, and for clustering such relatives into protein families. Here we review some of the protein family resources available to the biologist and describe how these can be used to provide structural and functional annotations for newly determined sequences. In particular we describe recent developments to the CATH domain database of protein structural families which have facilitated genome annotation and which have also revealed important caveats that must be considered when transferring functional data between homologous proteins.
Nucleic Acids Research | 2001
M. Mar Albà; David A. Lee; Frances M. G. Pearl; Adrian J. Shepherd; Nigel J. Martin; Christine A. Orengo; Paul Kellam
VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships. Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus_database/ VIDA.html.
PLOS Computational Biology | 2007
Oliver Redfern; Andrew P. Harrison; Timothy Dallman; Frances M. G. Pearl; Christine A. Orengo
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification.