Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Raja Mazumder is active.

Publication


Featured researches published by Raja Mazumder.


BMC Bioinformatics | 2003

The COG database: an updated version includes eukaryotes

Roman L. Tatusov; Natalie D. Fedorova; John D. Jackson; Aviva R. Jacobs; Boris Kiryutin; Eugene V. Koonin; Dmitri M. Krylov; Raja Mazumder; Sergei L. Mekhedov; Anastasia N. Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V. Sverdlov; Sona Vasudevan; Yuri I. Wolf; Jodie J. Yin; Darren A. Natale

BackgroundThe availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.ResultsWe describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after euk aryotic o rthologous g roups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The euk aryotic o rthologous g roups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.ConclusionThe updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.


Nucleic Acids Research | 2006

The Universal Protein Resource (UniProt): an expanding universe of protein information

Cathy H. Wu; Rolf Apweiler; Amos Marc Bairoch; Darren A. Natale; Winona C. Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; María Martín; Raja Mazumder; Claire O'Donovan; Nicole Redaschi; Baris E. Suzek

The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at or downloaded at .


Bioinformatics | 2007

UniRef: comprehensive and non-redundant UniProt reference clusters

Baris E. Suzek; Hongzhan Huang; Peter B. McGarvey; Raja Mazumder; Cathy H. Wu

MOTIVATION Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


Current Opinion in Structural Biology | 2002

Trends in protein evolution inferred from sequence and structure analysis

L. Aravind; Raja Mazumder; Sona Vasudevan; Eugene V. Koonin

Complementary developments in comparative genomics, protein structure determination and in-depth comparison of protein sequences and structures have provided a better understanding of the prevailing trends in the emergence and diversification of protein domains. The investigation of deep relationships among different classes of proteins involved in key cellular functions, such as nucleic acid polymerases and other nucleotide-dependent enzymes, indicates that a substantial set of diverse protein domains evolved within the primordial, ribozyme-dominated RNA world.


Bioinformatics | 2012

Toward community standards in the quest for orthologs

Christophe Dessimoz; Toni Gabaldón; David S. Roos; Erik L. L. Sonnhammer; Javier Herrero; Adrian M. Altenhoff; Rolf Apweiler; Michael Ashburner; Judith A. Blake; Brigitte Boeckmann; Alan Bridge; Elspeth Bruford; Mike Cherry; Matthieu Conte; Durand Dannie; Ruchira S. Datta; Jean-Baka Domelevo Entfellner; Ingo Ebersberger; Michael Y. Galperin; Jacob M. Joseph; Tina Koestler; Evgenia V. Kriventseva; Odile Lecompte; Jack Leunissen; Suzanna E. Lewis; Benjamin Linard; Michael S. Livstone; Hui-Chun Lu; María Martín; Raja Mazumder

The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications. Contact: [email protected]


PLOS ONE | 2011

Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation

Chuming Chen; Darren A. Natale; Robert D. Finn; Hongzhan Huang; Jian Zhang; Cathy H. Wu; Raja Mazumder

The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.


Journal of Microbiological Methods | 1999

Determining chemotactic responses by two subsurface microaerophiles using a simplified capillary assay method

Raja Mazumder; Tommy J. Phelps; Noel R. Krieg; Robert E. Benoit

A simplified capillary chemotaxis assay utilizing a hypodermic needle, syringe, and disposable pipette tip was developed to measure bacterial tactic responses. The method was applied to two strains of subsurface microaerophilic bacteria. This method was more convenient than the Adler method and required less practice. Isolate VT10 was a strain of Pseudomonas syringae, which was isolated from the shallow subsurface. It was chemotactically attracted toward dextrose, glycerol, and phenol, which could be used as sole carbon sources, and toward maltose, which could not be used. Isolate MR100 was phylogenetically related to Pseudomonas mendocina and was isolated from the deep subsurface. It showed no tactic response to these compounds, although, it could use dextrose, maltose, and glycerol as carbon sources. The chemotaxis results obtained by the new method were verified by using the swarm plate assay technique. The simplified technique may be useful for routine chemotactic testing.


Database | 2014

A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE)

Tsung-Jung Wu; Amirhossein Shamsaddini; Yang Pan; Krista Smith; Daniel J. Crichton; Vahan Simonyan; Raja Mazumder

Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu


Database | 2015

BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis

Quan Wan; Hayley Dingerdissen; Yu Fan; Naila Gulzar; Yang Pan; Tsung-Jung Wu; Cheng Yan; Haichen Zhang; Raja Mazumder

BioXpress is a gene expression and cancer association database in which the expression levels are mapped to genes using RNA-seq data obtained from The Cancer Genome Atlas, International Cancer Genome Consortium, Expression Atlas and publications. The BioXpress database includes expression data from 64 cancer types, 6361 patients and 17 469 genes with 9513 of the genes displaying differential expression between tumor and normal samples. In addition to data directly retrieved from RNA-seq data repositories, manual biocuration of publications supplements the available cancer association annotations in the database. All cancer types are mapped to Disease Ontology terms to facilitate a uniform pan-cancer analysis. The BioXpress database is easily searched using HUGO Gene Nomenclature Committee gene symbol, UniProtKB/RefSeq accession or, alternatively, can be queried by cancer type with specified significance filters. This interface along with availability of pre-computed downloadable files containing differentially expressed genes in multiple cancers enables straightforward retrieval and display of a broad set of cancer-related genes. Database URL: http://hive.biochemistry.gwu.edu/tools/bioxpress


Bioinformatics | 2011

A comprehensive protein-centric ID mapping service for molecular data integration

Hongzhan Huang; Peter B. McGarvey; Baris E. Suzek; Raja Mazumder; Jian Zhang; Yongxing Chen; Cathy H. Wu

MOTIVATION Identifier (ID) mapping establishes links between various biological databases and is an essential first step for molecular data integration and functional annotation. ID mapping allows diverse molecular data on genes and proteins to be combined and mapped to functional pathways and ontologies. We have developed comprehensive protein-centric ID mapping services providing mappings for 90 IDs derived from databases on genes, proteins, pathways, diseases, structures, protein families, protein interaction, literature, ontologies, etc. The services are widely used and have been regularly updated since 2006. AVAILABILITY www.uniprot.org/mappingandproteininformation-resource.org/pirwww/search/idmapping.shtml CONTACT [email protected].

Collaboration


Dive into the Raja Mazumder's collaboration.

Top Co-Authors

Avatar

Vahan Simonyan

Center for Biologics Evaluation and Research

View shared research outputs
Top Co-Authors

Avatar

Cathy H. Wu

University of Delaware

View shared research outputs
Top Co-Authors

Avatar

Hayley Dingerdissen

Washington University in St. Louis

View shared research outputs
Top Co-Authors

Avatar

Sona Vasudevan

Georgetown University Medical Center

View shared research outputs
Top Co-Authors

Avatar

Darren A. Natale

Georgetown University Medical Center

View shared research outputs
Top Co-Authors

Avatar

Konstantinos Karagiannis

Washington University in St. Louis

View shared research outputs
Top Co-Authors

Avatar

Yang Pan

Washington University in St. Louis

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge