Anne-Lise Veuthey
Swiss Institute of Bioinformatics
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anne-Lise Veuthey.
Bioinformatics | 2010
Anaïs Mottaz; Fabrice Pierre André David; Anne-Lise Veuthey; Yum Lina Yip
Summary: The SwissVar portal provides access to a comprehensive collection of single amino acid polymorphisms and diseases in the UniProtKB/Swiss-Prot database via a unique search engine. In particular, it gives direct access to the newly improved Swiss-Prot variant pages. The key strength of this portal is that it provides a possibility to query for similar diseases, as well as the underlying protein products and the molecular details of each variant. In the context of the recently proposed molecular view on diseases, the SwissVar portal should be in a unique position to provide valuable information for researchers and to advance research in this area. Availability: The SwissVar portal is available at www.expasy.org/swissvar Contact: [email protected]; [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
BMC Bioinformatics | 2008
Anaïs Mottaz; Yum Lina Yip; Patrick Ruch; Anne-Lise Veuthey
BackgroundAlthough the UniProt KnowledgeBase is not a medical-oriented database, it contains information on more than 2,000 human proteins involved in pathologies. However, these annotations are not standardized, which impairs the interoperability between biological and clinical resources. In order to make these data easily accessible to clinical researchers, we have developed a procedure to link diseases described in the UniProtKB/Swiss-Prot entries to the MeSH disease terminology.ResultsWe mapped disease names extracted either from the UniProtKB/Swiss-Prot entry comment lines or from the corresponding OMIM entry to the MeSH. Different methods were assessed on a benchmark set of 200 disease names manually mapped to MeSH terms. The performance of the retained procedure in term of precision and recall was 86% and 64% respectively. Using the same procedure, more than 3,000 disease names in Swiss-Prot were mapped to MeSH with comparable efficiency.ConclusionsThis study is a first attempt to link proteins in UniProtKB to the medical resources. The indexing we provided will help clinicians and researchers navigate from diseases to genes and from genes to diseases in an efficient way. The mapping is available at: http://research.isb-sib.ch/unimed.
Bioinformatics | 2005
Violaine Pillet; Marc Zehnder; Alexander K. Seewald; Anne-Lise Veuthey; Johann Petrak
UNLABELLED We present a new database, GPSDB (Gene and Protein Synonyms DataBase) which collects gene/protein names, in a species specific way, from 14 main biological resources. A web-based search interface gives access to the database: given a gene/protein name, it retrieves all synonyms for this entity and queries Medline with a set of user-selected terms. AVAILABILITY GPSDB is freely available from http://biomint.oefai.at/ CONTACT [email protected].
Journal of Bioinformatics and Computational Biology | 2007
Yum Lina Yip; Nathalie Lachenal; Violaine Pillet; Anne-Lise Veuthey
The UniProt/Swiss-Prot Knowledgebase records about 30,500 variants in 5,664 proteins (Release 52.2). Most of these variants are manually curated single amino acid polymorphisms (SAPs) with references to the literature. In order to keep the list of published documents related to SAPs up to date, an automatic information retrieval method is developed to recover texts mentioning SAPs. The method is based on the use of regular expressions (patterns) and rules for the detection and validation of mutations. When evaluated using a corpus of 9,820 PubMed references, the precision of the retrieval was determined to be 89.5% over all variants. It was also found that the use of nonstandard mutation nomenclature and sequence positional correction is necessary to retrieve a significant number of relevant articles. The method was applied to the 5,664 proteins with variants. This was performed by first submitting a PubMed query to retrieve articles using gene or protein names and a list of mutation-related keywords; the SAP detection procedure was then used to recover relevant documents. The method was found to be efficient in retrieving new references on known polymorphisms. New references on known SAPs will be rendered accessible to the public via the Swiss-Prot variant pages.
Journal of General Virology | 2000
Karim Abid; Rafael Quadri; Anne-Lise Veuthey; Antoine Hadengue; Francesco Negro
Hepatitis C virus (HCV) sequences from throughout the world have been grouped into six clades, based on recently proposed criteria. Here, the partial sequences and clade assignment are reported for three HCV isolates from chronic hepatitis C patients from Somalia, for whom conventional assays failed to identify the genotype. Phylogenetic analysis of the sequences of the core, envelope 1 and part of the non- structural 5b regions suggests that all three isolates belong to a distinct HCV genetic group, tentatively classified as subtype 3h. This novel HCV subtype shows the highest sequence similarity with HCV isolates from Indonesia. Despite the fact that these patients were infected with HCV clade 3, none of them responded to standard interferon treatment.
ieee international conference on information technology and applications in biomedicine | 2009
Julien Gobeill; E. Patsche; D. Theodoro; Anne-Lise Veuthey; Christian Lovis; Patrick Ruch
Biomedical professionals have at their disposal a huge amount of data, such as literature, i.e. textual contents, or databases, i.e. structured contents. But when they have a question, they often have to deal with too many documents in order to efficiently find the appropriate answer in a reasonable time. We have developed a Question Answering system which aims to analyze the users question, to retrieve the most relevant documents from MEDLINE, and to extract from these retrieved documents a list of candidate answers, ranked by confidence. These candidate answers are concepts issued from biomedical controlled vocabularies, such as the Medical Subject Headings (MeSH) for a first step, and are extracted from the most relevant documents with pattern matching strategies. For evaluation purposes, we apply the system on two biological databases, UniProt and DrugBank. From these resources, we generated two large benchmarks of 200 questions dealing respectively with diseases and proteins, and with diseases and drugs. For these 2 sets, the first candidate answer proposed by our system is respectively correct in 57% and in 68%, while respectively 70% and 75% of all answers to find are contained in the ten first proposed candidate answers. Despite the use of simple Information Extraction strategies, our system exploits the redundancy of information in literature in order to provide a powerful Question Answering system.
BMC Bioinformatics | 2008
Julien Gobeill; Imad Tbahriti; Frédéric Ehrler; Anaïs Mottaz; Anne-Lise Veuthey; Patrick Ruch
BackgroundThis paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.ResultsBased on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%).ConclusionsArgumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.
BMC Bioinformatics | 2013
Anne-Lise Veuthey; Alan Bridge; Julien Gobeill; Patrick Ruch; Johanna McEntyre; Lydie Bougueleret; Ioannis Xenarios
BackgroundThe annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB.ResultsThe procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments.ConclusionsThe information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.
International Journal of Medical Informatics | 2005
Pavel B. Dobrokhotov; Cyril Goutte; Anne-Lise Veuthey; Eric Gaussier
Bio-medical knowledge bases are valuable resources for the research community. Original scientific publications are the main source used to annotate them. Medical annotation in Swiss-Prot is specifically targeted at finding and extracting data about human genetic diseases and polymorphisms. Curators have to scan through hundreds of publications to select the relevant ones. This workload can be greatly reduced by using bio-text mining techniques. Using a combination of natural language processing (NLP) techniques and statistical classifiers, we achieve recall points of up to 84% on the potentially interesting documents and a precision of more than 96% in detecting irrelevant documents. Careful analysis of the document pre-processing chain allows us to measure the impact of some steps on the overall result, as well as test different classifier configurations. The best combination was used to create a prototype of a search and classification tool that is currently tested by the database curators.
Bioinformatics | 2014
Christophe Charpilloz; Anne-Lise Veuthey; Bastien Chopard; Jean-Luc Falcone
MOTIVATION Post-translational modifications (PTMs) are important steps in the maturation of proteins. Several models exist to predict specific PTMs, from manually detected patterns to machine learning methods. On one hand, the manual detection of patterns does not provide the most efficient classifiers and requires an important workload, and on the other hand, models built by machine learning methods are hard to interpret and do not increase biological knowledge. Therefore, we developed a novel method based on patterns discovery and decision trees to predict PTMs. The proposed algorithm builds a decision tree, by coupling the C4.5 algorithm with genetic algorithms, producing high-performance white box classifiers. Our method was tested on the initiator methionine cleavage (IMC) and N(α)-terminal acetylation (N-Ac), two of the most common PTMs. RESULTS The resulting classifiers perform well when compared with existing models. On a set of eukaryotic proteins, they display a cross-validated Matthews correlation coefficient of 0.83 (IMC) and 0.65 (N-Ac). When used to predict potential substrates of N-terminal acetyltransferaseB and N-terminal acetyltransferaseC, our classifiers display better performance than the state of the art. Moreover, we present an analysis of the model predicting IMC for Homo sapiens proteins and demonstrate that we are able to extract experimentally known facts without prior knowledge. Those results validate the fact that our method produces white box models. AVAILABILITY AND IMPLEMENTATION Predictors for IMC and N-Ac and all datasets are freely available at http://terminus.unige.ch/.