Jingyi Jessica Li
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jingyi Jessica Li.
Nature | 2014
Mark Gerstein; Joel Rozowsky; Koon Kiu Yan; Daifeng Wang; Chao Cheng; James B. Brown; Carrie A. Davis; LaDeana W. Hillier; Cristina Sisu; Jingyi Jessica Li; Baikang Pei; Arif Harmanci; Michael O. Duff; Sarah Djebali; Roger P. Alexander; Burak H. Alver; Raymond K. Auerbach; Kimberly Bell; Peter J. Bickel; Max E. Boeck; Nathan Boley; Benjamin W. Booth; Lucy Cherbas; Peter Cherbas; Chao Di; Alexander Dobin; Jorg Drenkow; Brent Ewing; Gang Fang; Megan Fastuca
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a ‘universal model’ based on a single set of organism-independent parameters.
PeerJ | 2014
Jingyi Jessica Li; Peter J. Bickel; Mark D. Biggin
Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000–16,000 molecules per cell and that differences in mRNA expression between genes explain only 10–40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transcription. Using individual measurements for 61 housekeeping proteins to rescale whole proteome data from Schwanhausser et al. (2011), we find that the median protein detected is expressed at 170,000 molecules per cell and that our corrected protein abundance estimates show a higher correlation with mRNA abundances than do the uncorrected protein data. In addition, we estimated the impact of further errors in mRNA and protein abundances using direct experimental measurements of these errors. The resulting analysis suggests that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwanhausser et al. (2011), though because one major source of error could not be estimated the true percent contribution should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhausser et al. (2011), and that the measured and inferred translation rates correlate poorly (R2 = 0.13). Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. We also determined the percent contributions of transcription, RNA degradation, translation and protein degradation to the variance in protein abundances using both of our strategies. While the magnitudes of the two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role. Finally, the above estimates only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.
Nature | 2014
Alan P. Boyle; Carlos L. Araya; Cathleen M. Brdlik; Philip Cayting; Chao Cheng; Yong Cheng; Kathryn E. Gardner; LaDeana W. Hillier; J. Janette; Lixia Jiang; Dionna M. Kasper; Trupti Kawli; Pouya Kheradpour; Anshul Kundaje; Jingyi Jessica Li; Lijia Ma; Wei Niu; E. Jay Rehm; Joel Rozowsky; Matthew Slattery; Rebecca Spokony; Robert Terrell; Dionne Vafeados; Daifeng Wang; Peter Weisdepp; Yi-Chieh Wu; Dan Xie; Koon Kiu Yan; Elise A. Feingold; Peter J. Good
Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.
Proceedings of the National Academy of Sciences of the United States of America | 2012
William W. Fisher; Jingyi Jessica Li; Ann S. Hammonds; James B. Brown; Barret D. Pfeiffer; Richard Weiszmann; Stewart MacArthur; Sean Thomas; John A. Stamatoyannopoulos; Michael B. Eisen; Peter J. Bickel; Mark D. Biggin; Susan E. Celniker
In animals, each sequence-specific transcription factor typically binds to thousands of genomic regions in vivo. Our previous studies of 20 transcription factors show that most genomic regions bound at high levels in Drosophila blastoderm embryos are known or probable functional targets, but genomic regions occupied only at low levels have characteristics suggesting that most are not involved in the cis-regulation of transcription. Here we use transgenic reporter gene assays to directly test the transcriptional activity of 104 genomic regions bound at different levels by the 20 transcription factors. Fifteen genomic regions were selected based solely on the DNA occupancy level of the transcription factor Kruppel. Five of the six most highly bound regions drive blastoderm patterns of reporter transcription. In contrast, only one of the nine lowly bound regions drives transcription at this stage and four of them are not detectably active at any stage of embryogenesis. A larger set of 89 genomic regions chosen using criteria designed to identify functional cis-regulatory regions supports the same trend: genomic regions occupied at high levels by transcription factors in vivo drive patterned gene expression, whereas those occupied only at lower levels mostly do not. These results support studies that indicate that the high cellular concentrations of sequence-specific transcription factors drive extensive, low-occupancy, nonfunctional interactions within the accessible portions of the genome.
Proceedings of the National Academy of Sciences of the United States of America | 2011
Jingyi Jessica Li; Ci-Ren Jiang; James B. Brown; Haiyan Huang; Peter J. Bickel
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called “sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation” (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.
Science | 2015
Jingyi Jessica Li; Biggin
Transcription, not translation, chiefly determines protein abundance in mammals [Also see Research Article by Jovanovic et al.] Mammalian proteins are expressed at ∼103 to 108 molecules per cell (1). Differences between cell types, between normal and disease states, and between individuals are largely defined by changes in the abundance of proteins, which are in turn determined by rates of transcription, messenger RNA (mRNA) degradation, translation, and protein degradation. If the rates for one of these steps differ much more than the rates of the other three, that step would be dominant in defining the variation in protein expression. Over the past decade, system-wide studies have claimed that in animals, differences in translation rates predominate (2–5). On page 1112 of this issue, Jovanovic et al. (6), as well as recent studies by Battle et al. (7) and Li et al. (1), challenge this conclusion, suggesting that transcriptional control makes the larger contribution.
Biotechnology Progress | 2015
Zhenke Liu; Shujia Dai; Jonathan Bones; Somak Ray; Sangwon Cha; Barry L. Karger; Jingyi Jessica Li; Lee Wilson; Greg Hinckle; Anthony Rossomando
A goal in recombinant protein production using Chinese hamster ovary (CHO) cells is to achieve both high specific productivity and high cell density. Addition of glucose to the culture media is necessary to maintain both cell growth and viability. We varied the glucose concentration in the media from 5 to 16 g/L and found that although specific productivity of CHO‐DG44 cells increased with the glucose level, the integrated viable cell density decreased. To examine the biological basis of these results, we conducted a discovery proteomic study of CHO‐DG44 cells grown under batch conditions in normal (5 g/L) or high (15 g/L) glucose over 3, 6, and 9 days. Approximately 5,000 proteins were confidently identified against an mRNA‐based CHO‐DG44 specific proteome database, with 2,800 proteins quantified with at least two peptides. A self‐organizing map algorithm was used to deconvolute temporal expression profiles of quantitated proteins. Functional analysis of altered proteins suggested that differences in growth between the two glucose levels resulted from changes in crosstalk between glucose metabolism, recombinant protein expression, and cell death, providing an overall picture of the responses to high glucose environment. The high glucose environment may enhance recombinant dihydrofolate reductase in CHO cells by up‐regulating NCK1 and down‐regulating PRKRA, and may lower integrated viable cell density by activating mitochondrial‐ and endoplasmic reticulum‐mediated cell death pathways by up‐regulating HtrA2 and calpains. These proteins are suggested as potential targets for bioengineering to enhance recombinant protein production.
bioRxiv | 2017
Wei Vivian Li; Jingyi Jessica Li
The analysis of single-cell RNA-seq (scRNA-seq) data is complicated and biased by excess zero or near zero counts, the so-called dropouts due to the low amounts of mRNA sequenced within individual cells. We introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. scImpute is shown as an effective tool to enhance the clustering of cell populations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics in time series scRNA-seq experiments.
BMC Genomics | 2016
Yuting Ye; Jingyi Jessica Li
BackgroundThe advent of next-generation RNA sequencing (RNA-seq) has greatly advanced transcriptomic studies, including system-wide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a high-throughput manner from RNA-seq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions.ResultsWe have developed a preselection method called “Non-negative Matrix Factorization Preselection” (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNA-seq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms.ConclusionNMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNA-seq data.
Journal of Computational Biology | 2012
Qinghui Gao; Christine Ho; Yingmin Jia; Jingyi Jessica Li; Haiyan Huang
Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. Supplementary Material is available online (at www.liebertonline.com/cmb).