Arek Kasprzyk
Ontario Institute for Cancer Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arek Kasprzyk.
Nucleic Acids Research | 2002
Tim Hubbard; Darren Barker; Ewan Birney; Graham Cameron; Yuan Chen; L. Clark; Tony Cox; James Cuff; V. Curwen; Thomas A. Down; Richard Durbin; E. Eyras; James Gilbert; Martin Hammond; L. Huminiecki; Arek Kasprzyk; Heikki Lehväslaiho; Philip Lijnzaad; Craig Melsopp; Emmanuel Mongin; R. Pettett; M. Pocock; Simon Potter; A. Rust; Esther Schmidt; Stephen M. J. Searle; Guy Slater; J. Smith; W. Spooner; A. Stabenau
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.
Bioinformatics | 2005
Steffen Durinck; Yves Moreau; Arek Kasprzyk; Sean Davis; Bart De Moor; Alvis Brazma; Wolfgang Huber
biomaRt is a new Bioconductor package that integrates BioMart data resources with data analysis software in Bioconductor. It can annotate a wide range of gene or gene product identifiers (e.g. Entrez-Gene and Affymetrix probe identifiers) with information such as gene symbol, chromosomal coordinates, Gene Ontology and OMIM annotation. Furthermore biomaRt enables retrieval of genomic sequences and single nucleotide polymorphism information, which can be used in data analysis. Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to the BioMart databases (e.g. Ensembl). The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.
Nucleic Acids Research | 2009
Syed Haider; Benoit Ballester; Damian Smedley; Junjun Zhang; Peter A. Rice; Arek Kasprzyk
BioMart Central Portal (www.biomart.org) offers a one-stop shop solution to access a wide array of biological databases. These include major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE; for a complete list, visit, http://www.biomart.org/biomart/martview. Moreover, the web server features seamless data federation making cross querying of these data sources in a user friendly and unified way. The web server not only provides access through a web interface (MartView), it also supports programmatic access through a Perl API as well as RESTful and SOAP oriented web services. The website is free and open to all users and there is no login requirement.
Database | 2011
Arek Kasprzyk
Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems. The BioMart project (www.biomart.org) was initiated to address these challenges. The BioMart software is based on two fundamental concepts: data agnostic modelling and data federation. Data agnostic modelling simplifies the difficult and time-consuming task of data modelling. In BioMart, this is achieved by using a predefined, query-optimized relational schema that can be used to represent any kind of data (1). Data federation makes it possible to organize multiple, disparate and distributed database systems into what appears to be a single integrated virtual database. It therefore allows users to access and cross reference data from these data sources using a single user interface, without the need for database administrators to physically collate the data in one location. Using these fundamental concepts, the BioMart project has driven a change in the biological data management paradigm, where individual biological databases are managed by different custom built systems. To give more control to both the users and the data providers, a new, innovative solution was required. BioMart started by adapting data warehousing ideas to create one universal software system for biological data management and empower biologists with the ability to create complex, customized datasets through a web interface without the need for bioinformatics support (1). It subsequently introduced a new innovative way of creating large multi-database repositories that avoid the need to store all the data in a single location (2), and finally it proved that large-scale projects involving next generation sequencing data can be managed efficiently in a distributed environment (3). BioMart has successfully adapted data warehousing ideas such as data marts, dimensional modelling (4), and query optimization into the world of biological databases (5–13). BioMarts ability to quickly deploy a website hosting any type of data, user-friendly graphical user interface, several programmatic interfaces and support for third party tools contributed to its success and adoption by many different types of projects around the world as their data management platform (14). During the 10 years of its existence, BioMart has grown from humble beginnings as a ‘data mining extension’ for the Ensembl website (1), to become an international collaboration involving large number of different organizations located on five continents: Asia, Australia, Europe, North America and South America (3,15). It has a large community of users and developers and it has been successfully used in both academia and industry. The latest version of the BioMart software has been significantly enhanced with numerous graphical user interfaces that are tailored to different user groups. In addition, it has been further improved by parallel query processing, it is now extensible with different analysis tools and the installation process can be effortlessly completed with just a few mouse clicks (16). Building on the wealth of information that has become accessible through the BioMart interface, the BioMart Central Portal (15) has introduced an innovative alternative to the large data stores maintained by specialized organizations such as The European Bioinformatics Institute (EBI) or The National Center for Biotechnology Information (NCBI). BioMart Central Portal is a first-of-its-kind, community-driven effort to provide unified access to dozens of biological databases. All development and maintenance of individual databases is left to the individual data providers, making it a very cost-effective approach. The groups that maintain individual sources can do so at their own location without the necessity of any data exchange procedures. In addition, they can draw on the wealth of information available through the portal to expose their data in the context of third party annotations. The BioMart Central Portal approach is very democratic: everyone can join or remove their data source at any time. BioMart Central Portal is effectively a ‘Virtual Bioinformatics Institute’ with no onsite personnel, minimal administration, and a very ‘green’ footprint. More recently, the International Cancer Genome Consortium (ICGC) Data Portal has demonstrated how BioMart can scale to manage large collaborative projects involving next generation sequencing data (3). The ICGC is generating data on an unprecedented scale by sequencing 500 cancer genomes and matched normal control genomes for 50 different cancer types (17). The effort is distributed between multiple participating countries and sequencing centres. Given the scale of the effort, moving all of the data to a single location is impractical. Instead, the ICGC Data Portal relies on BioMart data federation. By replicating and distributing the data model across different centres that produce the same type of data according to the same recipe, the scalability of the effort is greatly improved. Each centre is only responsible for managing their own data while data access to all of the consortium data is managed by the BioMart software. This presents a scalable approach, not only in the traditional sense of parallelizing data processing and storage, but also in a more general sense of outsourcing the external annotation expertise by federating annotations from additional, independently-maintained databases that are available in the BioMart Central Portal. The future developments for BioMart involve specialized ‘pre-packaged’ and reusable data portals. One example already in development is the OncoPortal, aimed at researchers managing cancer data. It will include preconfigured access to sources of annotations that are useful for cancer research such as Ensembl (5), Reactome (12), COSMIC (9), Pancreatic Expression Database (10) and others. It will also include a set of tools that are specifically designed for cancer data analysis. There are plans to build other preconfigured portals for different research areas, such as a mouse portal and a model organism portal. It is an ambition of the BioMart community that the BioMart project remains at the forefront of innovative solutions for biological data management in the years to come. By creating these specialized solutions and further reducing the barriers to entry, the aim is to encourage more groups to share their data through BioMart, thereby further enhancing the entire BioMart community.
Nucleic Acids Research | 2003
Michele Clamp; D. Andrews; Darren Barker; Paul Bevan; Graham Cameron; Yuting Chen; Louise Clark; Tony Cox; James Cuff; Val Curwen; Thomas A. Down; Richard Durbin; Eduardo Eyras; James Gilbert; Martin Hammond; Tim Hubbard; Arek Kasprzyk; Damian Keefe; Heikki Lehväslaiho; Vishwanath R. Iyer; Craig Melsopp; Emmanuel Mongin; Roger Pettett; Simon Potter; Alistair G. Rust; Esther Schmidt; Steve Searle; Guy Slater; James Smith; William Spooner
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.
Database | 2011
Junjun Zhang; Joachim Baran; Anthony Cros; Jonathan M. Guberman; Syed Haider; Jack Hsu; Yong Liang; Elena Rivkin; Jianxin Wang; Brett Whitty; Marie Wong-Erasmus; Long Yao; Arek Kasprzyk
The International Cancer Genome Consortium (ICGC) is a collaborative effort to characterize genomic abnormalities in 50 different cancer types. To make this data available, the ICGC has created the ICGC Data Portal. Powered by the BioMart software, the Data Portal allows each ICGC member institution to manage and maintain its own databases locally, while seamlessly presenting all the data in a single access point for users. The Data Portal currently contains data from 24 cancer projects, including ICGC, The Cancer Genome Atlas (TCGA), Johns Hopkins University, and the Tumor Sequencing Project. It consists of 3478 genomes and 13 cancer types and subtypes. Available open access data types include simple somatic mutations, copy number alterations, structural rearrangements, gene expression, microRNAs, DNA methylation and exon junctions. Additionally, simple germline variations are available as controlled access data. The Data Portal uses a web-based graphical user interface (GUI) to offer researchers multiple ways to quickly and easily search and analyze the available data. The web interface can assist in constructing complicated queries across multiple data sets. Several application programming interfaces are also available for programmatic access. Here we describe the organization, functionality, and capabilities of the ICGC Data Portal. Database URL: http://dcc.icgc.org
Database | 2011
Jonathan M. Guberman; J. Ai; Olivier Arnaiz; Joachim Baran; Andrew Blake; Richard Baldock; Claude Chelala; David Croft; Anthony Cros; Rosalind J. Cutts; A. Di Génova; Simon A. Forbes; T. Fujisawa; Emanuela Gadaleta; David Goodstein; Gunes Gundem; Bernard Haggarty; Syed Haider; Matthew Hall; Todd W. Harris; Robin Haw; Songnian Hu; Simon J. Hubbard; Jack Hsu; Vivek Iyer; Philip Jones; Toshiaki Katayama; Rhoda Kinsella; Lei Kong; Daniel Lawson
BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities. Database URL: http://central.biomart.org.
Database | 2011
Junjun Zhang; Syed Haider; Joachim Baran; Anthony Cros; Jonathan M. Guberman; Jack Hsu; Yong Liang; Long Yao; Arek Kasprzyk
BioMart is a freely available, open source, federated database system that provides a unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart allows databases hosted on different servers to be presented seamlessly to users, facilitating collaborative projects between different research groups. BioMart contains several levels of query optimization to efficiently manage large data sets and offers a diverse selection of graphical user interfaces and application programming interfaces to ensure that queries can be performed in whatever manner is most convenient for the user. The software has now been adopted by a large number of different biological databases spanning a wide range of data types and providing a rich source of annotation available to bioinformaticians and biologists alike. Database URL: http://www.biomart.org
Radiotherapy and Oncology | 2012
Maud H. W. Starmans; Kenneth C. Chu; Syed Haider; Francis Nguyen; Renaud Seigneuric; Michaël G. Magagnin; Marianne Koritzinsky; Arek Kasprzyk; Paul C. Boutros; Bradly G. Wouters; Philippe Lambin
BACKGROUND AND PURPOSE Recent data suggest that in vitro and in vivo derived hypoxia gene-expression signatures have prognostic power in breast and possibly other cancers. However, both tumour hypoxia and the biological adaptation to this stress are highly dynamic. Assessment of time-dependent gene-expression changes in response to hypoxia may thus provide additional biological insights and assist in predicting the impact of hypoxia on patient prognosis. MATERIALS AND METHODS Transcriptome profiling was performed for three cell lines derived from diverse tumour-types after hypoxic exposure at eight time-points, which include a normoxic time-point. Time-dependent sets of co-regulated genes were identified from these data. Subsequently, gene ontology (GO) and pathway analyses were performed. The prognostic power of these novel signatures was assessed in parallel with previous in vitro and in vivo derived hypoxia signatures in a large breast cancer microarray meta-dataset (n=2312). RESULTS We identified seven recurrent temporal and two general hypoxia signatures. GO and pathway analyses revealed regulation of both common and unique underlying biological processes within these signatures. None of the new or previously published in vitro signatures consisting of hypoxia-induced genes were prognostic in the large breast cancer dataset. In contrast, signatures of repressed genes, as well as the in vivo derived signatures of hypoxia-induced genes showed clear prognostic power. CONCLUSIONS Only a subset of hypoxia-induced genes in vitro demonstrates prognostic value when evaluated in a large clinical dataset. Despite clear evidence of temporal patterns of gene-expression in vitro, the subset of prognostic hypoxia regulated genes cannot be identified based on temporal pattern alone. In vivo derived signatures appear to identify the prognostic hypoxia induced genes. The prognostic value of hypoxia-repressed genes is likely a surrogate for the known importance of proliferation in breast cancer outcome.
very large data bases | 2004
Byron Choi; Wenfei Fan; Xibei Jia; Arek Kasprzyk
XML has become the prime standard for data exchange on the Web. To exchange data residing in Databases, one needs to publish it in XML. Data publishing is often done with a predefined “schema.” The chapter proposes a new approach for schema-directed publishing of relational data in XML, based on the novel notion of attribute transformation grammars (ATGs). ATGs provide guidance on how to define views conforming to target schemas (DTDs), and automatically ensure schema conformance. The chapter also develops an incremental algorithm for maintaining XML views produced by ATGs, based on new incremental computation techniques that capitalize on the hierarchical structure of XML data and unique features of the ATGs. Taking real-life data from Gene Ontology (GO), this chapter also demonstrates how this system can efficiently publish the GO data in XML with respect to a predefined recursive DTD, and how it incrementally updates the target XML data in response to changes to the underlying GO database.