Featured Researches

Genomics

An annotated list of bivalent chromatin regions in human ES cells: a new tool for cancer epigenetic research

CpG islands (CGI) marked by bivalent chromatin in stem cells are believed to be more prone to aberrant DNA methylation in tumor cells. The robustness and genome-wide extent of this instructive program in different cancer types remain to be determined. To address this issue we developed a user-friendly approach to integrate the stem cell chromatin signature in customized DNA methylation analyses. We used publicly available ChIP-sequencing datasets of several human embryonic stem cell (hESC) lines to determine the extent of bivalent chromatin genome-wide. We then created annotated lists of high-confidence bivalent, H3K4me3-only and H3K27me3-only chromatin regions. The main features of bivalent regions included localization in CGI/promoters, depletion in retroelements and enrichment in specific histone modifications, including the poorly characterized H3K23me2 mark. Moreover, bivalent promoters could be classified in three clusters based on PRC2 and PolII complexes occupancy. Genes with bivalent promoters of the PRC2-defined cluster displayed the lowest expression upon differentiation. As proof-of-concept, we assessed the DNA methylation pattern of eight types of tumors and confirmed that aberrant cancer-associated DNA hypermethylation preferentially targets CGI characterized by bivalent chromatin in hESCs. We also found that such aberrant DNA hypermethylation affected particularly bivalent CGI/promoters associated with genes that tend to remain repressed upon differentiation. Strikingly, bivalent CGI were the most affected by aberrant DNA hypermethylation in both CpG Island Methylator Phenotype-positive (CIMP+) and CIMP-negative tumors, suggesting that, besides transcriptional silencing in the pre-tumorigenic cells, the bivalent chromatin signature in hESCs is a key determinant of the instructive program for aberrant DNA methylation.

Read more
Genomics

An observation of circular RNAs in bacterial RNA-seq data

Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained from {\it Enterococcus faecalis} and {\it Escherichia coli} in a way similar to previous studies performed on eukaryotes. We report observations of circRNAs in RNA-seq data that are reproducible across multiple experiments performed with different protocols or growth conditions.

Read more
Genomics

Analysis of Chromosome 20 - A Study

Since the arrival of next-generation sequencing technologies the amount of genetic sequencing data has increased dramatically. This has has fueled an increase in human genetics research. At the same time, with the recent advent of technologies in processing large data sets, lot of these technologies are proving valuable and efficient in analyzing these huge datasets. In this paper we use some of these technologies to analyze genetic sequencing data of 1000 Genomes Project,produce and evaluate a framework to process the sequencing data thereof and look into structural variations with respect to population groups.

Read more
Genomics

Analysis of Extremely Obese Individuals Using Deep Learning Stacked Autoencoders and Genome-Wide Genetic Data

The aetiology of polygenic obesity is multifactorial, which indicates that life-style and environmental factors may influence multiples genes to aggravate this disorder. Several low-risk single nucleotide polymorphisms (SNPs) have been associated with BMI. However, identified loci only explain a small proportion of the variation ob-served for this phenotype. The linear nature of genome wide association studies (GWAS) used to identify associations between genetic variants and the phenotype have had limited success in explaining the heritability variation of BMI and shown low predictive capacity in classification studies. GWAS ignores the epistatic interactions that less significant variants have on the phenotypic outcome. In this paper we utilise a novel deep learning-based methodology to reduce the high dimensional space in GWAS and find epistatic interactions between SNPs for classification purposes. SNPs were filtered based on the effects associations have with BMI. Since Bonferroni adjustment for multiple testing is highly conservative, an important proportion of SNPs involved in SNP-SNP interactions are ignored. Therefore, only SNPs with p-values < 1x10-2 were considered for subsequent epistasis analysis using stacked auto encoders (SAE). This allows the nonlinearity present in SNP-SNP interactions to be discovered through progressively smaller hidden layer units and to initialise a multi-layer feedforward artificial neural network (ANN) classifier. The classifier is fine-tuned to classify extremely obese and non-obese individuals. The best results were obtained with 2000 compressed units (SE=0.949153, SP=0.933014, Gini=0.949936, Lo-gloss=0.1956, AUC=0.97497 and MSE=0.054057). Using 50 compressed units it was possible to achieve (SE=0.785311, SP=0.799043, Gini=0.703566, Logloss=0.476864, AUC=0.85178 and MSE=0.156315).

Read more
Genomics

Analysis of Gene Interaction Graphs as Prior Knowledge for Machine Learning Models

Gene interaction graphs aim to capture various relationships between genes and can represent decades of biology research. When trying to make predictions from genomic data, those graphs could be used to overcome the curse of dimensionality by making machine learning models sparser and more consistent with biological common knowledge. In this work, we focus on assessing how well those graphs capture dependencies seen in gene expression data to evaluate the adequacy of the prior knowledge provided by those graphs. We propose a condition graphs should satisfy to provide good prior knowledge and test it using `Single Gene Inference' tasks. We also compare with randomly generated graphs, aiming to measure the true benefit of using biologically relevant graphs in this context, and validate our findings with five clinical tasks. We find some graphs capture relevant dependencies for most genes while being very sparse. Our analysis with random graphs finds that dependencies can be captured almost as well at random which suggests that, in terms of gene expression levels, the relevant information about the state of the cell is spread across many genes.

Read more
Genomics

Analysis of Genomic and Transcriptomic Variations as Prognostic Signature for Lung Adenocarcinoma

Lung cancer is the leading cause of the largest number of deaths worldwide and lung adenocarcinoma (LUAD) is the most common form of lung cancer. In this study, we carried out an integrated meta-analysis of the mutations including single-nucleotide variations (SNVs), the copy number variations (CNVs), RNA-seq and clinical data of patients with LUAD downloaded from The Cancer Genome Atlas (TCGA). We integrated significant SNV and CNV genes, differentially expressed genes (DEGs) and the DEGs in active subnetworks to construct a prognosis signature. Cox proportional hazards model (LOOCV) with Lasso penalty was used to identify the best gene signature among different gene categories. The patients in both training and test data were clustered into high-risk and low-risk groups by using risk scores of the patients calculated based on selected gene signature. We generated a 12-gene signature (DEPTOR, ZBTB16, BCHE, MGLL, MASP2, TNNI2, RAPGEF3, SGK2, MYO1A, CYP24A1, PODXL2, CCNA1) for overall survival prediction. The survival time of high-risk and low-risk groups was significantly different. This 12-gene signature could predict prognosis and they are potential predictors for the survival of the patients with LUAD.

Read more
Genomics

Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.

Read more
Genomics

Analysis of evolutionary origins of genomic loci harboring 59,732 candidate human-specific regulatory sequences identifies genetic divergence patterns during evolution of Great Apes

Our view of the universe of genomic regions harboring various types of candidate human-specific regulatory sequences (HSRS) has been markedly expanded in recent years. To infer the evolutionary origins of loci harboring HSRS, analyses of conservations patterns of 59,732 loci in Modern Humans, Chimpanzee, Bonobo, Gorilla, Orangutan, Gibbon, and Rhesus genomes have been performed. Two major evolutionary pathways have been identified comprising thousands of sequences that were either inherited from extinct common ancestors (ECAs) or created de novo in humans after human/chimpanzee split. Thousands of HSRS appear inherited from ECAs yet bypassed genomes of our closest evolutionary relatives, presumably due to the incomplete lineage sorting and/or species-specific loss or regulatory DNA. The bypassing pattern is prominent for HSRS associated with development and functions of human brain. Common genomic loci that may contributed to speciation during evolution of Great Apes comprise 248 insertions sites of African Great Ape-specific retrovirus PtERV1 (45.9%; p = 1.03E-44) intersecting regions harboring 442 HSRS, which are enriched for HSRS associated with human-specific (HS) changes of gene expression in cerebral organoids. Among non-human primates (NHP), most significant fractions of candidate HSRS associated with HS expression changes in both excitatory neurons (347 loci; 67%) and radial glia (683 loci; 72%) are highly conserved in Gorilla genome. Modern Humans acquired unique combinations of regulatory sequences highly conserved in distinct species of six NHP separated by 30 million years of evolution. Concurrently, this unique mosaic of regulatory sequences inherited from ECAs was supplemented with 12,486 created de novo HSRS. These observations support the model of complex continuous speciation process during evolution of Great Apes that is not likely to occur as an instantaneous event.

Read more
Genomics

Analysis of genetic differences between psychiatric disorders: Exploring pathways and cell-types/tissues involved and ability to differentiate the disorders by polygenic scores

Although displaying genetic correlations, psychiatric disorders are clinically defined as categorical entities as they each have distinguishing clinical features and may involve different treatments. Identifying differential genetic variations between these disorders may reveal how the disorders differ biologically and help to guide more personalized treatment. Here we presented a comprehensive analysis to identify genetic markers differentially associated with various psychiatric disorders/traits based on GWAS summary statistics, covering 18 psychiatric traits/disorders and 26 comparisons. We also conducted comprehensive analysis to unravel the genes, pathways and SNP functional categories involved, and the cell types and tissues implicated. We also assessed how well one could distinguish between psychiatric disorders by polygenic risk scores (PRS). SNP-based heritabilities (h2SNP) were significantly larger than zero for most comparisons. Based on current GWAS data, PRS have mostly modest power to distinguish between psychiatric disorders. For example, we estimated that AUC for distinguishing schizophrenia from major depressive disorder (MDD), bipolar disorder (BPD) from MDD and schizophrenia from BPD were 0.694, 0.602 and 0.618 respectively, while the maximum AUC (based on h2SNP) were 0.763, 0.749 and 0.726 respectively. We also uncovered differences in each pair of studied traits in terms of their differences in genetic correlation with comorbid traits. For example, clinically-defined MDD appeared to more strongly genetically correlated with other psychiatric disorders and heart disease, when compared to non-clinically-defined depression in UK Biobank. Our findings highlight genetic differences between psychiatric disorders and the mechanisms involved. PRS may aid differential diagnosis of selected psychiatric disorders in the future with larger GWAS samples.

Read more
Genomics

Anchor points for genome alignment based on Filtered Spaced Word Matches

Alignment of large genomic sequences is a fundamental task in computational genome analysis. Most methods for genomic alignment use high-scoring local alignments as {\em anchor points} to reduce the search space of the alignment procedure. Speed and quality of these methods therefore depend on the underlying anchor points. Herein, we propose to use {\em Filtered Spaced Word Matches} to calculate anchor points for genome alignment. To evaluate this approach, we used these anchor points in the the widely used alignment pipeline {\em Mugsy}. For distantly related sequence sets, we could substantially improve the quality of alignments produced by {\em Mugsy}.

Read more

Ready to get started?

Join us today