Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sandeep Chakraborty.
bioRxiv | 2016
Sandeep Chakraborty
Saffron (Crocus sativus) is a spice with immense economic and medicinal relevance, due to its anticancer and chemopreventive properties. Although the genomic sequence of saffron is not publicly available, the RNA-seq based transcriptome of saffron from Jammu and Kashmir provides several, yet explored, insights into the metagenome of the plant from that region. In the current work, sequence databases were created in the YeATS suite from the NCBI and Ensembl databases to enable faster comparisons. These were used to determine the metagenome of saffron. Soybean mosaic virus, a potyvirus, was found to be abundantly expressed in all five tissues analyzed. Recent studies have highlighted that issues arising from latent potyvirus infections in saffron is severely underestimated. Bacterial and fungal identification is made complex due to symbiogenesis, especially in the absence of the endogenous genome. Symbiogenesis results in transcripts having significant homology to bacterial genomes and eu-karyotic genomes. A stringent criterion based on homology comparison was used to identify bacterial and fungal transcripts, and inferences were constrained to the genus level. Leifsonia, Elizabethkingia and Staphylococcus were some of the identified bacteria, while Mycosphaerella and Pyrenophora were among the fungi detected. Among the bacterial genera, L. xyli is the causal agent for ratoon stunting disease in sugarcane, while E. meningoseptica and S. haemolyticus, having acquired multiresistance against available antimicrobial agents, are important in clinical settings. Mycosphaerella and Pyrenophora incorporate several pathogenic species. It is shown that a transcript from heat shock protein of the fungi Cladosporium cladosporioides has been erroneously annotated as a saffron gene. The detection of these pathogens should enable proper strategies for ensuring better yields. The functional annotation of proteins in the absence of a genome is subject to errors due to the existence of significantly homologous proteins in organisms from different branches of life.
bioRxiv | 2016
Sandeep Chakraborty
An open reading frame (ORF) is genomic sequence that can be translated into amino acids, and does not contain any stop codon. Previously, YeATSAM analyzed ORFs from the RNA-seq derived transcriptome of walnut, and revealed several genes that were not annotated by widely-used methods. Here, a similar ORF-based method is applied to the chloroplast genome from walnut (Accid:KT963008). This revealed, in addition to the ~84 protein coding genes, ~100 additional putative protein coding genes with homology to RefSeq proteins. Some of these genes have corresponding transcripts in the previously derived transcriptome from twenty different tissues, establishing these as bona fide genes. Other genes have introns, and need to be manually annotated. Importantly, this analysis revealed the mis-annotation of the rps12 gene in several organisms which have used an automated annotation flow. This gene has three exons - exon1 is ~28kbp away from exon2 and exon3 - and is assembled by trans-splicing. Automated annotation tools are more likely to select an ORF closer to exon2 to complete a possible protein, and are unlikely to properly annotate trans-spliced genes. A database of trans-spliced genes would greatly benefit annotations. Thus, the current work continues previous work establishing the proper identification of ORFs as a simple and important step in many applications, and the requirement of validation of annotations.
bioRxiv | 2018
Sandeep Chakraborty
The emergence of drug-resistance in Enterococcus faecalis V583 through acquisition of resistance genes has been correlated to the absence of CRISPR-loci. Here, the presence of a bona-fide CRISPR locus in E. faecalis V583 (Accid:NC_004668.1) at 2238156 with a single 20 nt repeat is demonstrated. The presence of a putative endonuclease Cas1 13538 nucleotides away from the repeat substantiates this claim. This Cas1 (628 aa) is highly homologous (Eval:5e-34) to a Cas1 from Pseudanabaena biceps (Accid:WP 009625648.1, 697 aa), which belongs to the enigmatic family of RT-CRISPR locus. Such significant similarity to a Cas protein, the presence of a topoisomerase, other DUF (domain of unknown function) proteins as is often seen in CRISPR loci, and other hypothetical proteins indicates that this is a bona-fide CRISPR locus. Further corroboration is provided by expression of both the repeat and the Cas1 gene in existing RNA-seq data (SRX3438611). Since so little is known of even well-studied species like E. faecalis V583 having many hypothetical proteins, computational absence of evidence should not be taken as evidence of absence (both crisprfinder and PILER-CR do not report this as a CRISPR locus). It is unlikely that bacteria would completely give up defense against its primeval enemies (viruses) to bolster its fight against the newly introduced antibiotics.
bioRxiv | 2018
Sandeep Chakraborty
Clustered regularly interspaced short palindromic repeats (CRISPR) is a prokaryotic adaptive defense system that assimilates short sequences of invading genomes (spacers) within repeats, and uses nearby effector proteins (Cas), one of which is an endonuclease (Cas9), to cleave homologous nucleic acid during future infections from the same or closely related organisms. Here, a novel CRISPR locus with uncharacterized Cas proteins, is reported in Symbiobacterium thermophilum (Accid:NC 006177.1) around loc.1248561. Credence to this assertion is provided by four arguments. First, the presence of an exact repeat (CACGTGGGGTTCGGGTCGGACTG, 23 nucleotides) occurs eight times encompassing fragments about 83 nucleotides long. Second, comparison to a known CRISPR-Cas locus in the same organism (loc.355482) with an endonuclease Cas3 (WP 011194444.1, 729 aa) ∼10000 nt upstream shows the presence of a known MutS2 endonuclease (WP 011195247.1, 801 aa) in approximately the same distance in loc.1248561. Thirdly, and remarkably, an uncharacterized protein (1357 aa) long is uncannily close in length to known Cas9 proteins (1368 for Streptococcus pyogenes). Lastly, the presence of transposases and reverse transcriptase (RT) downstream of the repeat indicates this is one of an enigmatic RT-CRISPR locus, Also, the MutS2 endonuclease is not characterized as a CRISPR-endonuclease to the best of my knowledge. Interestingly, this locus was not among the four loci (three confirmed, one probable) reported by crisperfinder (http://crispr.i2bc.paris-saclay.fr/Server), indicating that the search algorithm needs to be revisited. This finding begs the question ‐ how many such CRISPR-Cas loci and Cas9-like proteins lie undiscovered within bacterial genomes?
bioRxiv | 2017
Sandeep Chakraborty
The use of RNA-seq from tumor-educated platelets (TEP) as a ‘liquid biopsy’ source [1] has been refuted recently (http://biorxiv.org/content/early/2017/06/05/146134, not peer-reviewed). The TEP-study also mentioned that mutant epidermal growth factor receptor (EGFR) was ‘accurately distinguished using surrogate TEP mRNA profiles’, which is contested here. It is shown that only 10 out of 24 (a smaller sample set, original study has 60) non-small cell lung carcinoma (NSCLC) samples here has any expression at all. Even there the number of reads (101 bp) are [1, 4, 1, 14, 9, 1, 2, 19, 21, 6], and do not even add up to one complete EGFR gene (about 6000 bp). EGFR mutations have been painstakingly collated in www.mycancergenome.org/content/disease/lung-cancer/egfr. In stark contrast, the TEP study has no specification of the EGFR mutant used. The TEP study found EGFR mutations in 17/21 (81%), and EGFR wild-type in 4/39 (10%) for NSCLC samples (Table S7, reflected in Fig 3, Panel E in percentages). A major flaw is the assumption that a non “EGFR wild-type” is a “EGFR mutant” since cases zero with EGFR reads (which are almost half of the samples) could be either. The application of the Kappa statistic to this data is erroneous for two reasons. First, the Kappa statistic does not handle “unknowns”, as is the case for samples with zero expression. Secondly, ‘interobserver variation can be measured in any situation in which two or more independent observers are evaluating the same thing’ [2]. The 90% (Fig 3, Panel E) is just the percentage of samples (35/39) that are not “EGFT WT” in one observation. It is not qualified to be in the Kappa matrix, where it translates to 35, leading to a Kappa=0.707, which implies “substantial agreement” [2]. The other observation (looking for EGFR mutation) is in a different set. To summarize, this work reiterates negligible expression of EGFR reads in NSCLC samples, and finds serious shortcoming in the statistical analysis of subsequent mutational analysis from these reads in the TEP-study.
bioRxiv | 2017
Sandeep Chakraborty
The prevailing excitement in the scientific and medical community about ‘liquid biopsy’, a minimally invasive diagnostic involving body fluid, is understandable, since it has the possibility of detecting pre-malignant and early-stage cancers, and enables the assessment of response to treatments [1]. Supplementing previous techniques that sample circulating cell-free tumor DNA, tumor cells, and mi-crovesicles [2], a recent work has shown using RNA-seq data that tumor-educated platelets (TEP) can distinguish 228 patients with localized and metastasized tumors from 55 healthy individuals with 96% accuracy [3]. However, as demonstrated in the current work, over-expression of MET genes in non-small cell lung carcinoma (NSCLC), and HER2/ERBB2 genes in breast cancer are grossly misreported. Based on an analysis of a smaller subset of samples, it is shown that there is little, leave alone over-expression, of these genes in the samples with the specified disease. Confirmation that this is bona-fide platelet mRNA is provided by high levels of the platelet marker TMSB4X. A kmer-based method (k=32) has been used here (KEATS) to detect homologous transcripts, although the results are easily verified by a BLAST search – BLAST’ing the MET gene (Accid: NM 001127500.2) to a NSCLC sample (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1982781) shows almost no expression. This is in contrast to expected expression of the MET gene in another a NSCLC sample (SRR3475320) from another sample. Similar contradictions apply for HER2/ERBB2 genes with respect to breast cancer samples. This work emphasizes the neccessity of a more stringent verification framework for bioinformatic analyses, and raises serious doubts on using TEP as a possible ‘liquid biopsy’ candidate.The prevailing excitement in the scientific and medical community about “liquid biopsy”, a minimally invasive diagnostic involving body fluid, is understandable, since it has the possibility of detecting pre-malignant and early-stage cancers, and enables the assessment of response to treatments. Supplementing previous techniques that sample circulating cell-free tumor DNA, tumor cells, and microvesicles, a recent work has shown using RNA-seq data that tumor-educated platelets (TEP) can distinguish 228 patients with localized and metastasized tumors from 55 healthy individuals with 96% accuracy. However, as demonstrated in the current work, over-expression of MET genes in non-small cell lung carcinoma (NSCLC), and HER2/ERBB2 genes in breast cancer are grossly misreported. Based on an analysis of a smaller subset of samples, it is shown that there is little, leave alone over-expression, of these genes in the samples with the specified disease. Further confirmation is provided by high levels of the platelet marker TMSB4X. A kmer-based method (k=32) has been used here (KEATS) to detect homologous transcripts, although the results are easily verified by a BLAST search - BLAST-ing the MET gene (Accid:NM 001127500.2) to a NSCLC sample (https://trace.ncbi.nlm.nih.gov/Trace/sra/?run=SRR1982781) shows almost no expression. This is in contrast to expected expression of the MET gene in another a NSCLC sample (SRR3475320) from another sample. Similar contradictions apply for HER2/ERBB2 genes with respect to breast cancer samples. This work emphasizes the necessity of a more stringent verification framework for bioinformatic analyses, and raises serious doubts on using TEP as a possible “liquid biopsy” candidate.
bioRxiv | 2017
Sandeep Chakraborty
The ability of Epstein Barr Virus (EBV) to transform resting cell B-cells into immortalized lymphoblastoid cell lines (LCL) provides a continuous source of peripheral blood lymphocytes that are used to model conditions in which these lymphocytes play a key role. Here, the PacBio generated transcriptome of three LCLs from a parent-daughter trio (SRAid:SRP036136) provided by a previous study [1] were analyzed using a kmer-based version of YeATS (KEATS). The set of over-expressed genes in these cell lines were determined based on a comparison with the PacBio transcriptome of twenty tissues provided by another study (hOPTRS) [2]. MIR155 long non-coding RNA (MIR155HG), Fc fragment of IgE receptor II (FCER2), T-cell leukemia/lymphoma 1A (TCL1A), and germinal center associated signaling and motility (GCSAM) were genes having the highest expression counts in the three LCLs with no expression in hOPTRS. Other over-expressed genes, having low expression in hOPTRS, were membrane spanning 4-domains A1 (MS4A1) and ribosomal protein S2 pseudogene 55 (RPS2P55). While some of these genes are known to be over-expressed in LCLs, this study provides a comprehensive cataloguing of such genes. A recent work involving a patient with EBV-positive large B-cell lymphoma was ‘unusually lacking various B-cell markers’, but over-expressing CD30 [3] - a gene ranked 79 among uniquely expressed genes here. Hypomethylation of chromosome 1 observed in EBV immortalized LCLs [4, 5] is also corroborated here by mapping the genes to chromosomes. Extending previous work identifying un-annotated genes [6], 80 genes were identified which are expressed in the three LCLs, not in hOPTRS, and missing in the GENCODE, RefSeq and RefSeqGene databases. KEATS introduces a method of determining expression counts based on a partitioning of the known annotated genes, has runtimes of a few hours on a personal workstation and provides detailed reports enabling proper debugging.
bioRxiv | 2017
Sandeep Chakraborty
PacBio sequencing generates much longer reads compared to second-generation sequencing technologies, with a trade-off of lower throughput, higher error rate and more cost per base. The PacBio transcriptome of the breast cancer cell line MCF-7 was found to have ∼300 transcripts un-annotated in the current GENCODE (v25) or RefSeq, and missing in the liver, heart and brain PacBio transcriptomes [1]. RACE-sequencing (RACE-seq [2]) extends a well-established method of characterizing cDNA molecules generated by rapid amplification of cDNA ends (RACE [3]) using high-throughput sequencing technologies, reducing costs compared to PacBio. Here, shorter fragments of ∼150 transcripts were found to be present in seven tissues analyzed in a recent RACE-seq study (Accid:ERP012249) [4]. These transcripts were not among the ∼2500 novel transcripts reported in that study, tested separately here using the genomic coordinates provided, although ‘all curated novel isoforms were incorporated into the human GENCODE set (v22)’ in that study. Non-redundancy analysis of the exclusive transcripts identified one transcript mapping to Chr1 with seven different splice variants, and erroneously mapped to Chr15 (PAC clone 15q11-q13) from the Prader-Willi/Angelman Syndrome region (Accid:AC004137.1). Finally, there are ∼100 non-redundant transcripts missing in the seven tissues, in addition to other three tissues analyzed previously. Their absence in GENCODE and RefSeq databases rule them out as commonly transcribed regions, further increasing their likelihood as biomarkers.
bioRxiv | 2017
Sandeep Chakraborty
Illuminating the ‘dark’ regions of the human genome remains an ongoing effort, a decade and a half after the human genome was sequenced - RefSeq and GENCODE being two of the major annotation databases. Pacific Biosciences (PacBio) has provided open access to the transcriptome of MCF-7, a breast cancer cell line that has provided significant therapeutic advancement in breast cancer research since the 1970s. PacBio sequencing generates much longer reads compared to second-generation sequencing technologies, with a trade-off of lower throughput, higher error rate and more cost per base. Here, this transcriptome was analyzed using the YeATS pipeline, with additionally introduced kmer based algorithms, reducing computational times to a few hours on a simple workstation. Out of ~300 transcripts that have no match in both RefSeq and GENCODE, ~250 are absent in the transcriptomes of the heart, liver and brain, also provided by PacBio. Also, ~200 transcripts are absent in a recent catalogue of un-annotated long non-coding RNAs from 6,503 samples (~43 Terabases of sequence data) [1], and among 2,556 novel transcripts reported in an experimental workflow RACE-Seq [2]. 65 transcripts have >100 amino acid open reading frames, and have the potential of being protein coding genes. ORF based annotation also identified few bacterial transcripts in the PacBio database mapped to the human genome, and one human transcript that has been annotated as bacterial in the NCBI database. The current work reiterates the under-utilization of transcriptomes for annotating genomes. It also provides new leads for investigating breast cancer by virtue of exclusively expressed transcripts not expressed in other tissues, which have the prospects of breast cancer biomarkers based on further investigations.
bioRxiv | 2017
Sandeep Chakraborty
Final amendment note This paper had proposed a plausible way for detecting large quantities of MET, which the authors have clarified was not done :the possible explanation proposed for this erroneous MET gene expression does bypass the filtering step we perform in the data processing pipeline, i.e. selection of intron-spanning reads, as can be read in the main text” comments in http://www.biorxiv.org/content/early/2017/07/02/146134, where a continuing critique of the TEP study continues. Please consider this pre-print closed. Original abstract The reported over-expression of MET genes in non-small cell lung carcinoma (NSCLC) from an analysis of the RNA-seq data from tumor-educated platelets (TEP), intended to supplement existing ‘liquid biopsy’ techniques [1], has been refuted recently (http://biorxiv.org/content/early/2017/06/05/146134, not peer-reviewed). The MET proto-oncogene (Accid:NG 008996.1, RefSeqGene LRG 662 on chromosome 7, METwithintrons) encodes 21 exons resulting in a 6710 bps MET gene (Accid: NM 001127500.2, METonlyexons). METwithintrons has multiple matches in the RNA-seq derived reads of lung cancer samples (for example: SRR1982756.11853382). Unfortunately, these are non-specific sequences in the intronic regions, matching to multiple genes on different chromosomes with 100% identity (KIF6 on chr6, COL6A6 on chr3, MYO16 on chr13, etc. for SRR1982756.11853382). In contrast, METonlyexons has few matches in the reads, if at all [2]. However, even RNA-seq from healthy donors have similar matches for METwithintrons so the computation behind the over-expression statistic remains obscure, even if METwithintrons was used as the search gene. In summary, this work re-iterates the lack of reproducibility in the bioinformatic analysis that establishes TEP as a possible source for “liquid biopsy”.