Vahan Simonyan
Center for Biologics Evaluation and Research
Nucleic Acids Research | 2004
John B. Anderson; Praveen F. Cherukuri; Carol DeWeese-Scott; Lewis Y. Geer; Marc Gwadz; Siqian He; David I. Hurwitz; John D. Jackson; Zhaoxi Ke; Christopher J. Lanczycki; Cynthia A. Liebert; Chunlei Liu; Fu Lu; Gabriele H. Marchler; Mikhail Mullokandov; Benjamin A. Shoemaker; Vahan Simonyan; James S. Song; Paul A. Thiessen; Roxanne A. Yamashita; Jodie J. Yin; Dachuan Zhang; Stephen H. Bryant
The Conserved Domain Database (CDD) is the protein classification component of NCBIs Entrez query and retrieval system. CDD is linked to other Entrez databases such as Proteins, Taxonomy and PubMed®, and can be accessed at CD-Search, which is available at, is a fast, interactive tool to identify conserved domains in new protein sequences. CD-Search results for protein sequences in Entrez are pre-computed to provide links between proteins and domain models, and computational annotation visible upon request. Protein–protein queries submitted to NCBIs BLAST search service at are scanned for the presence of conserved domains by default. While CDD started out as essentially a mirror of publicly available domain alignment collections, such as SMART, Pfam and COG, we have continued an effort to update, and in some cases replace these models with domain hierarchies curated at the NCBI. Here, we report on the progress of the curation effort and associated improvements in the functionality of the CDD information retrieval system.
Database | 2014
Tsung-Jung Wu; Amirhossein Shamsaddini; Yang Pan; Krista Smith; Daniel J. Crichton; Vahan Simonyan; Raja Mazumder
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta:; CSR:; HIVE:
Genomics, Proteomics & Bioinformatics | 2013
Phuc Vinh Nguyen Lam; Radoslav Goldman; Konstantinos Karagiannis; Tejas Narsule; Vahan Simonyan; Valerii Soika; Raja Mazumder
The asparagine-X-serine/threonine (NXS/T) motif, where X is any amino acid except proline, is the consensus motif for N-linked glycosylation. Significant numbers of high-resolution crystal structures of glycosylated proteins allow us to carry out structural analysis of the N-linked glycosylation sites (NGS). Our analysis shows that there is enough structural information from diverse glycoproteins to allow the development of rules which can be used to predict NGS. A Python-based tool was developed to investigate asparagines implicated in N-glycosylation in five species: Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana and Saccharomyces cerevisiae. Our analysis shows that 78% of all asparagines of NXS/T motif involved in N-glycosylation are localized in the loop/turn conformation in the human proteome. Similar distribution was revealed for all the other species examined. Comparative analysis of the occurrence of NXS/T motifs not known to be glycosylated and their reverse sequence (S/TXN) shows a similar distribution across the secondary structural elements, indicating that the NXS/T motif in itself is not biologically relevant. Based on our analysis, we have defined rules to determine NGS. Using machine learning methods based on these rules we can predict with 93% accuracy if a particular site will be glycosylated. If structural information is not available the tool uses structural prediction results resulting in 74% accuracy. The tool was used to identify glycosylation sites in 108 human proteins with structures and 2247 proteins without structures that have acquired NXS/T site/s due to non-synonymous variation. The tool, Structure Feature Analysis Tool (SFAT), is freely available to the public at
Molecular Phylogenetics and Evolution | 2012
Dmitriy V. Volokhov; Vahan Simonyan; Maureen K. Davidson; Vladimir E. Chizhikov
Conventional classification of the species in the family Mycoplasmataceae is mainly based on phenotypic criteria, which are complicated, can be difficult to measure, and have the potential to be hampered by phenotypic deviations among the isolates. The number of biochemical reactions suitable for phenotypic characterization of the Mycoplasmataceae is also very limited and therefore the strategy for the final identification of the Mycoplasmataceae species is based on comparative serological results. However, serological testing of the Mycoplasmataceae species requires a performance panel of hyperimmune sera which contains anti-serum to each known species of the family, a high level of technical expertise, and can only be properly performed by mycoplasma-reference laboratories. In addition, the existence of uncultivated and fastidious Mycoplasmataceae species/isolates in clinical materials significantly complicates, or even makes impossible, the application of conventional bacteriological tests. The analysis of available genetic markers is an additional approach for the primary identification and phylogenetic classification of cultivable species and uncultivable or fastidious organisms in standard microbiological laboratories. The partial nucleotide sequences of the RNA polymerase β-subunit gene (rpoB) and the 16S-23S rRNA intergenic transcribed spacer (ITS) were determined for all known type strains and the available non-type strains of the Mycoplasmataceae species. In addition to the available 16S rRNA gene data, the ITS and rpoB sequences were used to infer phylogenetic relationships among these species and to enable identification of the Mycoplasmataceae isolates to the species level. The comparison of the ITS and rpoB phylogenetic trees with the 16S rRNA reference phylogenetic tree revealed a similar clustering patterns for the Mycoplasmataceae species, with minor discrepancies for a few species that demonstrated higher divergence of their ITS and rpoB in comparison to their neighbor species. Overall, our results demonstrated that the ITS and rpoB gene could be useful complementary phylogenetic markers to infer phylogenetic relationships among the Mycoplasmataceae species and provide useful background information for the choice of appropriate metabolic and serological tests for the final classification of isolates. In summary, three-target sequence analysis, which includes the ITS, rpoB, and 16S rRNA genes, was demonstrated to be a reliable and useful taxonomic tool for the species differentiation within the family Mycoplasmataceae based on their phylogenetic relatedness and pairwise sequence similarities. We believe that this approach might also become a valuable tool for routine analysis and primary identification of new isolates in medical and veterinary microbiological laboratories.
Vaccine | 2016
Edward T. Mee; Mark D. Preston; Philip D. Minor; Silke Schepelmann; Xuening Huang; Jenny Nguyen; David Wall; Stacey Hargrove; Thomas Fu; George Xu; Li Li; Colette Cote; Eric Delwart; Linlin Li; Indira Hewlett; Vahan Simonyan; Viswanath Ragupathy; Voskanian-Kordi Alin; Nicolas Mermod; Christiane Hill; Birgit Ottenwälder; Daniel C. Richter; Arman Tehrani; Weber-Lehmann Jacqueline; Jean-Pol Cassart; Carine Letellier; Olivier Vandeputte; Jean-Louis Ruelle; Avisek Deyati; Fabio La Neve
Abstract Background Unbiased deep sequencing offers the potential for improved adventitious virus screening in vaccines and biotherapeutics. Successful implementation of such assays will require appropriate control materials to confirm assay performance and sensitivity. Methods A common reference material containing 25 target viruses was produced and 16 laboratories were invited to process it using their preferred adventitious virus detection assay. Results Fifteen laboratories returned results, obtained using a wide range of wet-lab and informatics methods. Six of 25 target viruses were detected by all laboratories, with the remaining viruses detected by 4–14 laboratories. Six non-target viruses were detected by three or more laboratories. Conclusion The study demonstrated that a wide range of methods are currently used for adventitious virus detection screening in biological products by deep sequencing and that they can yield significantly different results. This underscores the need for common reference materials to ensure satisfactory assay performance and enable comparisons between laboratories.
PLOS ONE | 2014
Luis V. Santana-Quintero; Hayley Dingerdissen; Jean Thierry-Mieg; Raja Mazumder; Vahan Simonyan
Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. Availability
Genes | 2014
Vahan Simonyan; Raja Mazumder
The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis.
Nucleic Acids Research | 2014
Yang Pan; Konstantinos Karagiannis; Haichen Zhang; Hayley Dingerdissen; Amirhossein Shamsaddini; Quan Wan; Vahan Simonyan; Raja Mazumder
Identification of non-synonymous single nucleotide variations (nsSNVs) has exponentially increased due to advances in Next-Generation Sequencing technologies. The functional impacts of these variations have been difficult to ascertain because the corresponding knowledge about sequence functional sites is quite fragmented. It is clear that mapping of variations to sequence functional features can help us better understand the pathophysiological role of variations. In this study, we investigated the effect of nsSNVs on more than 17 common types of post-translational modification (PTM) sites, active sites and binding sites. Out of 1 705 285 distinct nsSNVs on 259 216 functional sites we identified 38 549 variations that significantly affect 10 major functional sites. Furthermore, we found distinct patterns of site disruptions due to germline and somatic nsSNVs. Pan-cancer analysis across 12 different cancer types led to the identification of 51 genes with 106 nsSNV affected functional sites found in 3 or more cancer types. 13 of the 51 genes overlap with previously identified Significantly Mutated Genes (Nature. 2013 Oct 17;502(7471)). 62 mutations in these 13 genes affecting functional sites such as DNA, ATP binding and various PTM sites occur across several cancers and can be prioritized for additional validation and investigations.
Database | 2016
Vahan Simonyan; Konstantin Chumakov; Hayley Dingerdissen; William J. Faison; Scott Goldweber; Anton Golikov; Naila Gulzar; Konstantinos Karagiannis; Phuc Vinh Nguyen Lam; Thomas Maudru; Olesja Muravitskaja; Ekaterina Osipova; Yang Pan; Alexey Pschenichnov; Alexandre Rostovtsev; Luis V. Santana-Quintero; Krista Smith; Elaine E. Thompson; Valery Tkachenko; John Torcivia-Rodriguez; Alin Voskanian; Quan Wan; Jing Wang; Tsung-Jung Wu; Carolyn A. Wilson; Raja Mazumder
The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data. This multicomponent cloud infrastructure provides secure web access for authorized users to deposit, retrieve, annotate and compute on NGS data, and to analyse the outcomes using web interface visual environments appropriately built in collaboration with research and regulatory scientists and other end users. Unlike many massively parallel computing environments, HIVE uses a cloud control server which virtualizes services, not processes. It is both very robust and flexible due to the abstraction layer introduced between computational requests and operating system processes. The novel paradigm of moving computations to the data, instead of moving data to computational nodes, has proven to be significantly less taxing for both hardware and network infrastructure. The honeycomb data model developed for HIVE integrates metadata into an object-oriented model. Its distinction from other object-oriented databases is in the additional implementation of a unified application program interface to search, view and manipulate data of all types. This model simplifies the introduction of new data types, thereby minimizing the need for database restructuring and streamlining the development of new integrated information systems. The honeycomb model employs a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without flooding the security subsystem with a multiplicity of rules. HIVE infrastructure will allow engineers and scientists to perform NGS analysis in a manner that is both efficient and secure. HIVE is actively supported in public and private domains, and project collaborations are welcomed. Database URL:
FEBS Journal | 2013
Hayley Dingerdissen; Mona Motwani; Konstantinos Karagiannis; Vahan Simonyan; Raja Mazumder
An enzymes active site is essential to normal protein activity such that any disruptions at this site may lead to dysfunction and disease. Nonsynonymous single‐nucleotide variations (nsSNVs), which alter the amino acid sequence, are one type of disruption that can alter the active site. When this occurs, it is assumed that enzyme activity will vary because of the criticality of the site to normal protein function. We integrate nsSNV data and active site annotations from curated resources to identify all active‐site‐impacting nsSNVs in the human genome and search for all pathways observed to be associated with this data set to assess the likely consequences. We find that there are 934 unique nsSNVs that occur at the active sites of 559 proteins. Analysis of the nsSNV data shows an over‐representation of arginine and an under‐representation of cysteine, phenylalanine and tyrosine when comparing the list of nsSNV‐impacted active site residues with the list of all possible proteomic active site residues, implying a potential bias for or against variation of these residues at the active site. Clustering analysis shows an abundance of hydrolases and transferases. Pathway and functional analysis shows several pathways over‐ or under‐represented in the data set, with the most significantly affected pathways involved in carbohydrate metabolism. We provide a table of 32 variation–substrate/product pairs that can be used in targeted metabolomics experiments to assay the effects of specific variations. In addition, we report the significant prevalence of aspartic acid to histidine variation in eight proteins associated with nine diseases including glycogen storage diseases, lacrimo‐auriculo‐dento‐digital syndrome, Parkinsons disease and several cancers.