Featured Researches

Genomics

A draft genome assembly of southern bluefin tuna Thunnus maccoyii

Tuna are large pelagic fish whose populations are close to panmixia. In addition, they are threatened species, so it is important for the maintenance and monitoring of genetic diversity that genetic information at a genome level be obtained. Here we report the draft assembly of the southern bluefin tuna genome and the collection of genome-wide sequence data for five other tuna species. We sampled five tuna species of the genus Thunnus, the northern and southern bluefin, yellowfin, albacore, and bigeye, as well as the skipjack (Katsuwonis pelamis), a tuna-like species. Genome assembly was facilitated at k-mer=25 while k-mer=51 generated assembly artefacts. The estimated size of the southern bluefin tuna genome was 795 Mb. We assembled two southern bluefin tuna individuals independently using both paired end and mate pair sequence. This resulted in scaffolds with N50>174,000 bp and maximum scaffold lengths>1.4 Mb. Our estimate of the size of the assembled genome was the scaffolded sequences in common to both assemblies, which amounted to 721 Mb of the 795 Mb of the southern bluefin tuna genome sequence. Using BLAST, there were matches between 13,039 of 14,341 (91%) refseq mRNA of the zebrafish Danio rerio to the tuna assembly indicating that most of a generic fish transcriptome was covered by the assembly.

Read more
Genomics

A frame-based representation of genomic sequences for removing errors and rare variant detection in NGS data

We propose a frame-based representation of k-mers for detecting sequencing errors and rare variants in next generation sequencing data obtained from populations of closely related genomes. Frames are sets of non-orthogonal basis functions, traditionally used in signal processing for noise removal. We define a frame for genomes and sequenced reads to consist of discrete spatial signals of every k-mer of a given size. We show that each k-mer in the sequenced data can be projected onto multiple frames and these projections are maximized for spatial signals corresponding to the k-mer's substrings. Our proposed classifier, MultiRes, is trained on the projections of k-mers as features used for marking k-mers as erroneous or true variations in the genome. We evaluate MultiRes on simulated and real viral population datasets and compare it to other error correction methods known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs), fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is freely available from the GitHub link (this https URL).

Read more
Genomics

A framework to decipher the genetic architecture of combinations of complex diseases: applications in cardiovascular medicine

Genome-wide association studies(GWAS) have proven to be highly useful in revealing the genetic basis of complex diseases. At present, most GWAS are studies of a particular single disease diagnosis against controls. However, in practice, an individual is often affected by more than one condition/disorder. For example, patients with coronary artery disease(CAD) are often comorbid with diabetes mellitus(DM). Along a similar line, it is often clinically meaningful to study patients with one disease but without a comorbidity. For example, obese DM may have different pathophysiology from non-obese DM. Here we developed a statistical framework to uncover susceptibility variants for comorbid disorders (or a disorder without comorbidity), using GWAS summary statistics only. In essence, we mimicked a case-control GWAS in which the cases are affected with comorbidities or a disease without a relevant comorbid condition (in either case, we may consider the cases as those affected by a specific subtype of disease, as characterized by the presence or absence of comorbid conditions). We extended our methodology to deal with continuous traits with clinically meaningful categories (e.g. lipids). In addition, we illustrated how the analytic framework may be extended to more than two traits. We verified the feasibility and validity of our method by applying it to simulated scenarios and four cardiometabolic (CM) traits. We also analyzed the genes, pathways, cell-types/tissues involved in CM disease subtypes. LD-score regression analysis revealed some subtypes may indeed be biologically distinct with low genetic correlations. Further Mendelian randomization analysis found differential causal effects of different subtypes to relevant complications. We believe the findings are of both scientific and clinical value, and the proposed method may open a new avenue to analyzing GWAS data.

Read more
Genomics

A generalized linear model for decomposing cis-regulatory, parent-of-origin, and maternal effects on allele-specific gene expression

Joint quantification of genetic and epigenetic effects on gene expression is important for understanding the establishment of complex gene regulation systems in living organisms. In particular, genomic imprinting and maternal effects play important roles in the developmental process of mammals and flowering plants. However, the influence of these effects on gene expression are difficult to quantify because they act simultaneously with cis-regulatory mutations. Here we propose a simple method to decompose cis-regulatory (i.e., allelic genotype, AG), genomic imprinting (i.e., parent-of-origin, PO), and maternal (i.e., maternal genotype, MG) effects on allele-specific gene expression using RNA-seq data obtained from reciprocal crosses. We evaluated the efficiency of method using a simulated dataset and applied the method to whole-body Drosophila and mouse trophoblast stem cell (TSC) and liver RNA-seq data. Consistent with previous studies, we found little evidence of PO and MG effects in adult Drosophila samples. In contrast, we identified dozens and hundreds of mouse genes with significant PO and MG effects, respectively. Interestingly, a similar number of genes with significant PO effect were detect in mouse TSCs and livers, whereas more genes with significant MG effect were observed in livers. Further application of this method will clarify how these three effects influence gene expression levels in different tissues and developmental stages, and provide novel insight into the evolution of gene expression regulation.

Read more
Genomics

A genomic dominion with regulatory dependencies on human-specific single-nucleotide changes in Modern Humans

Gene set enrichment analyses of 8,405 genes linked with 35,074 human-specific (hs) regulatory single-nucleotide changes (SNCs) revealed the staggering breadth of significant associations with morphological structures, physiological processes, and pathological conditions of Modern Humans. Significant enrichment traits include more than 1,000 anatomically-distinct regions of the adult human brain, many different types of human cells and tissues, more than 200 common human disorders and more than 1,000 records of rare diseases. Thousands of genes connected with regulatory hsSNCs have been identified in this contribution, which represent essential genetic elements of the autosomal inheritance and survival of species phenotypes: a total of 1,494 genes linked with either autosomal dominant or recessive inheritance as well as 2,273 genes associated with premature death, embryonic lethality, as well as pre-, peri-, neo-, and post-natal lethality of both complete and incomplete penetrance. Therefore, thousands of heritable traits and critical genes impacting the offspring survival appear under the human-specific regulatory control in genomes of Modern Humans. These observations highlight the remarkable translational opportunities afforded by the discovery of genetic regulatory loci harboring hsSNCs that are fixed in humans, distinct from other primates, and located in differentially-accessible (DA) chromatin regions during human brain development.

Read more
Genomics

A guided network propagation approach to identify disease genes that combines prior and new information

A major challenge in biomedical data science is to identify the causal genes underlying complex genetic diseases. Despite the massive influx of genome sequencing data, identifying disease-relevant genes remains difficult as individuals with the same disease may share very few, if any, genetic variants. Protein-protein interaction networks provide a means to tackle this heterogeneity, as genes causing the same disease tend to be proximal within networks. Previously, network propagation approaches have spread signal across the network from either known disease genes or genes that are newly putatively implicated in the disease (e.g., found to be mutated in exome studies or linked via genome-wide association studies). Here we introduce a general framework that considers both sources of data within a network context. Specifically, we use prior knowledge of disease-associated genes to guide random walks initiated from genes that are newly identified as perhaps disease-relevant. In large-scale testing across 24 cancer types, we demonstrate that our approach for integrating both prior and new information not only better identifies cancer driver genes than using either source of information alone but also readily outperforms other state-of-the-art network-based approaches. To demonstrate the versatility of our approach, we also apply it to genome-wide association data to identify genes functionally relevant for several complex diseases. Overall, our work suggests that guided network propagation approaches that utilize both prior and new data are a powerful means to identify disease genes.

Read more
Genomics

A machine learning approach to drug repositioning based on drug expression profiles: Applications to schizophrenia and depression/anxiety disorders

Development of new medications is a very lengthy and costly process. Finding novel indications for existing drugs, or drug repositioning, can serve as a useful strategy to shorten the development cycle. In this study, we present an approach to drug discovery or repositioning by predicting indication for a particular disease based on expression profiles of drugs, with a focus on applications in psychiatry. Drugs that are not originally indicated for the disease but with high predicted probabilities serve as good candidates for repurposing. This framework is widely applicable to any chemicals or drugs with expression profiles measured, even if the drug targets are unknown. It is also highly flexible as virtually any supervised learning algorithms can be used. We applied this approach to identify repositioning opportunities for schizophrenia as well as depression and anxiety disorders. We applied various state-of-the-art machine learning (ML) approaches for prediction, including deep neural networks, support vector machines (SVM), elastic net, random forest and gradient boosted machines. The performance of the five approaches did not differ substantially, with SVM slightly outperformed the others. However, methods with lower predictive accuracy can still reveal literature-supported candidates that are of different mechanisms of actions. As a further validation, we showed that the repositioning hits are enriched for psychiatric medications considered in clinical trials. Notably, many top repositioning hits are supported by previous preclinical or clinical studies. Finally, we propose that ML approaches may provide a new avenue to explore drug mechanisms via examining the variable importance of gene features.

Read more
Genomics

A model for the clustered distribution of SNPs in the human genome

Motivated by a non-random but clustered distribution of SNPs, we introduce a phenomenological model to account for the clustering properties of SNPs in the human genome. The phenomenological model is based on a preferential mutation to the closer proximity of existing SNPs. With the Hapmap SNP data, we empirically demonstrate that the preferential model is better for illustrating the clustered distribution of SNPs than the random model. Moreover, the model is applicable not only to autosomes but also to the X chromosome, although the X chromosome has different characteristics from autosomes. The analysis of the estimated parameters in the model can explain the pronounced population structure and the low genetic diversity of the X chromosome. In addition, correlation between the parameters reveals the population-wise difference of the mutation probability. These results support the mutational non-independence hypothesis against random mutation.

Read more
Genomics

A multi-modal neural network for learning cis and trans regulation of stress response in yeast

Deciphering gene regulatory networks is a central problem in computational biology. Here, we explore the use of multi-modal neural networks to learn predictive models of gene expression that include cis and trans regulatory components. We learn models of stress response in the budding yeast Saccharomyces cerevisiae. Our models achieve high performance and substantially outperform other state-of-the-art methods such as boosting algorithms that use pre-defined cis-regulatory features. Our model learns several cis and trans regulators including well-known master stress response regulators. We use our models to perform in-silico TF knock-out experiments and demonstrate that in-silico predictions of target gene changes correlate with the results of the corresponding TF knockout microarray experiment.

Read more
Genomics

A quick guide for student-driven community genome annotation

High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions.

Read more

Ready to get started?

Join us today