Yongjin Park
Massachusetts Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yongjin Park.
Nature Biotechnology | 2014
Michael J. Smanski; Swapnil Bhatia; Dehua Zhao; Yongjin Park; Lauren B.A. Woodruff; Georgia Giannoukos; Dawn Ciulla; Michele Busby; Johnathan Calderon; Robert Nicol; D. Benjamin Gordon; Douglas Densmore; Christopher A. Voigt
Large microbial gene clusters encode useful functions, including energy utilization and natural product biosynthesis, but genetic manipulation of such systems is slow, difficult and complicated by complex regulation. We exploit the modularity of a refactored Klebsiella oxytoca nitrogen fixation (nif) gene cluster (16 genes, 103 parts) to build genetic permutations that could not be achieved by starting from the wild-type cluster. Constraint-based combinatorial design and DNA assembly are used to build libraries of radically different cluster architectures by varying part choice, gene order, gene orientation and operon occupancy. We construct 84 variants of the nifUSVWZM operon, 145 variants of the nifHDKY operon, 155 variants of the nifHDKYENJ operon and 122 variants of the complete 16-gene pathway. The performance and behavior of these variants are characterized by nitrogenase assay and strand-specific RNA sequencing (RNA-seq), and the results are incorporated into subsequent design cycles. We have produced a fully synthetic cluster that recovers 57% of wild-type activity. Our approach allows the performance of genetic parts to be quantified simultaneously in hundreds of genetic contexts. This parallelized design-build-test-learn cycle, which can access previously unattainable regions of genetic space, should provide a useful, fast tool for genetic optimization and hypothesis testing.
Nature Biotechnology | 2015
Yongjin Park; Manolis Kellis
Computational modeling of DNA and RNA targets of regulatory proteins is improved by a deep-learning approach.
Nature Genetics | 2017
Barbara E. Stranger; Lori E. Brigham; Richard Hasz; Marcus Hunter; Christopher Johns; Mark C. Johnson; Gene Kopen; William F. Leinweber; John T. Lonsdale; Alisa McDonald; Bernadette Mestichelli; Kevin Myer; Brian Roe; Michael Salvatore; Saboor Shad; Jeffrey A. Thomas; Gary Walters; Michael Washington; Joseph Wheeler; Jason Bridge; Barbara A. Foster; Bryan M. Gillard; Ellen Karasik; Rachna Kumar; Mark Miklos; Michael T. Moser; Scott Jewell; Robert G. Montroy; Daniel C. Rohrer; Dana R. Valley
Genetic variants have been associated with myriad molecular phenotypes that provide new insight into the range of mechanisms underlying genetic traits and diseases. Identifying any particular genetic variants cascade of effects, from molecule to individual, requires assaying multiple layers of molecular complexity. We introduce the Enhancing GTEx (eGTEx) project that extends the GTEx project to combine gene expression with additional intermediate molecular measurements on the same tissues to provide a resource for studying how genetic differences cascade through molecular phenotypes to impact human health.
Molecular Systems Biology | 2017
Thomas E. Gorochowski; Amin Espah Borujeni; Yongjin Park; Alec A. K. Nielsen; Jing Zhang; Bryan S. Der; D. Benjamin Gordon; Christopher A. Voigt
Genetic circuits implement computational operations within a cell. Debugging them is difficult because their function is defined by multiple states (e.g., combinations of inputs) that vary in time. Here, we develop RNA‐seq methods that enable the simultaneous measurement of: (i) the states of internal gates, (ii) part performance (promoters, insulators, terminators), and (iii) impact on host gene expression. This is applied to a three‐input one‐output circuit consisting of three sensors, five NOR/NOT gates, and 46 genetic parts. Transcription profiles are obtained for all eight combinations of inputs, from which biophysical models can extract part activities and the response functions of sensors and gates. Various unexpected failure modes are identified, including cryptic antisense promoters, terminator failure, and a sensor malfunction due to media‐induced changes in host gene expression. This can guide the selection of new parts to fix these problems, which we demonstrate by using a bidirectional terminator to disrupt observed antisense transcription. This work introduces RNA‐seq as a powerful method for circuit characterization and debugging that overcomes the limitations of fluorescent reporters and scales to large systems composed of many parts.
bioRxiv | 2017
Yongjin Park; Abhishek Sarkar; Kunal Bhutani; Manolis Kellis
Transcriptome-wide association studies (TWAS) have proven to be a powerful tool to identify genes associated with human diseases by aggregating cis-regulatory effects on gene expression. However, TWAS relies on building predictive models of gene expression, which are sensitive to the sample size and tissue on which they are trained. The Gene Tissue Expression Project has produced reference transcriptomes across 53 human tissues and cell types; however, the data is highly sparse, making it difficult to build polygenic models in relevant tissues for TWAS. Here, we propose fQTL, a multi-tissue, multivariate model for mapping expression quantitative trait loci and predicting gene expression. Our model decomposes eQTL effects into SNP-specific and tissue-specific components, pooling information across relevant tissues to effectively boost sample sizes. In simulation, we demonstrate that our multi-tissue approach outperforms single-tissue approaches in identifying causal eQTLs and tissues of action. Using our method, we fit polygenic models for 13,461 genes, characterized the tissue-specificity of the learned cis-eQTLs, and performed TWAS for Alzheimer’s disease and schizophrenia, identifying 107 and 382 associated genes, respectively.
bioRxiv | 2018
Samuel S Kim; Chengzhen Dai; Farhad Hormozdiari; Bryce van de Geijn; Steven Gazal; Yongjin Park; Luke O'Connor; Tiffany Amariuta; Po-Ru Loh; Hilary Finucane; Soumya Raychaudhuri; Alkes L. Price
Recent studies have highlighted the role of gene networks in disease biology. To formally assess this, we constructed a broad set of pathway, network, and pathway+network annotations and applied stratified LD score regression to 42 independent diseases and complex traits (average N=323K) to identify enriched annotations. First, we constructed annotations from 18,119 biological pathways, including 100kb windows around each gene. We identified 156 pathway-trait pairs whose disease enrichment was statistically significant (FDR < 5%) after conditioning on all genes and on annotations from the baseline-LD model, a stringent step that greatly reduced the number of pathways detected; most of the significant pathway-trait pairs were previously unreported. Next, for each of four published gene networks, we constructed probabilistic annotations based on network connectivity using closeness centrality, a measure of how close a gene is to other genes in the network. For each gene network, the network connectivity annotation was strongly significantly enriched. Surprisingly, the enrichments were fully explained by excess overlap between network annotations and regulatory annotations from the baseline-LD model, validating the informativeness of the baseline-LD model and emphasizing the importance of accounting for regulatory annotations in gene network analyses. Finally, for each of the 156 enriched pathway-trait pairs, for each of the four gene networks, we constructed pathway+network annotations by annotating genes with high network connectivity to the input pathway. For each gene network, these pathway+network annotations were strongly significantly enriched for the corresponding traits. Once again, the enrichments were largely explained by the baseline-LD model. In conclusion, gene network connectivity is highly informative for disease architectures, but the information in gene networks may be subsumed by regulatory annotations, such that accounting for known annotations is critical to robust inference of biological mechanisms.
bioRxiv | 2017
Kunal Bhutani; Abhishek Sarkar; Yongjin Park; Manolis Kellis; Nicholas J. Schork
Transcriptome-wide association studies (TWAS) test for associations between imputed gene expression levels and phenotypes in GWAS cohorts using models of transcriptional regulation learned from reference transcriptomes. However, current methods for TWAS only use point estimates of imputed expression and ignore uncertainty in the prediction. We develop a novel two-stage Bayesian regression method which incorporates uncertainty in imputed gene expression and achieves higher power to detect TWAS genes than existing TWAS methods as well as standard methods based on missing value and measurement error theory. We apply our method to GTEx whole blood transcriptomes and GWAS cohorts for seven diseases from the Wellcome Trust Case Control Consortium and find 45 TWAS genes, of which 17 do not overlap previously reported case-control GWAS or differential expression associations. Surprisingly, we replicate only 2 of 40 previously reported TWAS genes after accounting for uncertainty in the prediction.
bioRxiv | 2017
Yongjin Park; Abhishek Sarkar; Liang He; Jose Davila-Velderrain; Philip L. De Jager; Manolis Kellis
Characterizing the intermediate phenotypes, such as gene expression, that mediate genetic effects on complex diseases is a fundamental problem in human genetics. Existing methods utilize genotypic data and summary statistics to identify putative disease genes, but cannot distinguish pleiotropy from causal mediation and are limited by overly strong assumptions about the data. To overcome these limitations, we develop Causal Multivariate Mediation within Extended Linkage disequilibrium (CaMMEL), a novel Bayesian inference framework to jointly model multiple mediated and unmediated effects relying only on summary statistics. We show in simulation that CaMMEL accurately distinguishes between mediating and pleiotropic genes unlike existing methods. We applied CaMMEL to Alzheimer’s disease (AD) and found 206 causal genes in sub-threshold loci (p < 10−4). We prioritized 21 genes which mediate at least 5% of local genetic variance, disrupting innate immune pathways in AD.Characterizing the intermediate phenotypes such as gene expression that mediate genetic effects on complex diseases is a fundamental problem in human genetics. Existing methods based on imputation of transcriptomic data by utilizing genotypic data and summary statistics to identify putative disease genes cannot distinguish pleiotropy from causal mediation and are limited by overly strong assumptions about the data. To overcome these limitations, we develop a novel Bayesian inference framework, termed Causal Multivariate Mediation within Extended Linkage disequilibrium (CaMMEL), to jointly model multiple mediated and unmediated effects, relying only on summary statistics. We show in simulation unlike existing methods fail to distinguish between mediation and independent direct effects, CaMMEL accurately estimates mediation genes, clearly excluding pleiotropic genes. We applied our method to Alzheimer9s disease (AD) and found 21 genes in sub-threshold loci (p