Yongwook Choi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yongwook Choi is active.

Explore More

Publication

Featured researches published by Yongwook Choi.

Bioinformatics | 2015

PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels

Yongwook Choi; Agnes P. Chan

UNLABELLED We present a web server to predict the functional effect of single or multiple amino acid substitutions, insertions and deletions using the prediction tool PROVEAN. The server provides rapid analysis of protein variants from any organisms, and also supports high-throughput analysis for human and mouse variants at both the genomic and protein levels. AVAILABILITY AND IMPLEMENTATION The web server is freely available and open to all users with no login requirements at http://provean.jcvi.org. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Genome Biology | 2015

A novel method of consensus pan-chromosome assembly and large-scale comparative analysis reveal the highly flexible pan-genome of Acinetobacter baumannii

Agnes P. Chan; Granger Sutton; Jessica DePew; Radha Krishnakumar; Yongwook Choi; Xiao-Zhe Huang; Erin Beck; Derek M. Harkins; Maria Kim; Emil Lesho; Mikeljon P. Nikolich; Derrick E. Fouts

BackgroundInfections by pan-drug resistant Acinetobacter baumannii plague military and civilian healthcare systems. Previous A. baumannii pan-genomic studies used modest sample sizes of low diversity and comparisons to a single reference genome, limiting our understanding of gene order and content. A consensus representation of multiple genomes will provide a better framework for comparison. A large-scale comparative study will identify genomic determinants associated with their diversity and adaptation as a successful pathogen.ResultsWe determine draft-level genomic sequence of 50 diverse military isolates and conduct the largest bacterial pan-genome analysis of 249 genomes. The pan-genome of A. baumannii is open when the input genomes are normalized for diversity with 1867 core proteins and a paralog-collapsed pan-genome size of 11,694 proteins. We developed a novel graph-based algorithm and use it to assemble the first consensus pan-chromosome, identifying both the order and orientation of core genes and flexible genomic regions. Comparative genome analyses demonstrate the existence of novel resistance islands and isolates with increased numbers of resistance island insertions over time, from single insertions in the 1950s to triple insertions in 2011. Gene clusters responsible for carbon utilization, siderophore production, and pilus assembly demonstrate frequent gain or loss among isolates.ConclusionsThe highly variable and dynamic nature of the A. baumannii genome may be the result of its success in rapidly adapting to both abiotic and biotic environments through the gain and loss of gene clusters controlling fitness. Importantly, some archaic adaptation mechanisms appear to have reemerged among recent isolates.

IEEE Transactions on Information Theory | 2012

Compression of Graphical Structures: Fundamental Limits, Algorithms, and Experiments

Yongwook Choi; Wojciech Szpankowski

Information theory traditionally deals with “conventional data,” be it textual data, image, or video data. However, databases of various sorts have come into existence in recent years for storing “unconventional data” including biological data, social data, web data, topographical maps, and medical data. In compressing such data, one must consider two types of information: the information conveyed by the structure itself, and the information conveyed by the data labels implanted in the structure. In this paper, we attempt to address the former problem by studying information of graphical structures (i.e., unlabeled graphs). As the first step, we consider the Erdös-Rényi graphs G(n,p) over n vertices in which edges are added independently and randomly with probability p. We prove that the structural entropy of G(n,p) is (n;2)h(p)-logn!+o(1)=(n;2)h(p)-nlog+O(n), where h(p)=-plogp-(1-p)log(1-p) is the entropy rate of a conventional memoryless binary source. Then, we propose a two-stage compression algorithm that asymptotically achieves the structural entropy up to the nlog term (i.e., the first two leading terms) of the structural entropy. Our algorithm runs either in time O(n2) in the worst case for any graph or in time O(n+e) on average for graphs generated by G(n,p), where e is the average number of edges. To the best of our knowledge, this is the first provable (asymptotically) optimal graph compressor for Erdös-Rényi graph models. We use combinatorial and analytic techniques such as generating functions, Mellin transform, and poissonization to establish these findings. Our experiments confirm the theoretical results and also show the usefulness of our algorithm for some real-world graphs such as the Internet, biological networks, and social networks.

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine | 2012

A fast computation of pairwise sequence alignment scores between a protein and a set of single-locus variants of another protein

Yongwook Choi

Recently we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), for predicting the functional effect of protein sequence variations, including single amino acid substitutions and small insertions and deletions [2]. The prediction is based on the change, caused by a given variation, in the similarity of the query sequence to a set of its related protein sequences. For this prediction, the algorithm is required to compute a semi-global pairwise sequence alignment score between the query sequence and each of the related sequences. Using dynamic programming, it takes O(n · m) time to compute alignment score between the query sequence Q of length n and a related sequence S of length m. Thus given l different variations in Q, in a naive way it would take O(l · n · m) time to compute the alignment scores between each of the variant query sequences and S. In this paper, we present a new approach to efficiently compute the pairwise alignment scores for l variations, which takes O((n + l) · m) time when the length of variations is bounded by a constant. In this approach, we further utilize the solutions of overlapping subproblems, which are already used by dynamic programming approach. Our algorithm has been used to build a new database for precomputed prediction scores for all possible single amino acid substitutions, single amino acid insertions, and up to 10 amino acids deletions in about 91K human proteins (including isoforms), where l becomes very large, that is, l = O(n). The PROVEAN source code and web server are available at http://provean.jcvi.org.

international symposium on information theory | 2009

Compression of graphical structures

Yongwook Choi; Wojciech Szpankowski

F. Brooks argues in [3] there is “no theory that gives us a metric for information embodied in structure” Shannon himself alluded to it fifty years earlier in his little known 1953 paper [14]. Indeed, in the past information theory dealt mostly with “conventional data,” be it textual data, image, or video data. However, databases of various sorts have come into existence in recent years for storing “unconventional data” including biological data, web data, topographical maps, and medical data. In compressing such data structures, one must consider two types of information: the information conveyed by the structure itself, and the information conveyed by the data labels implanted in the structure. In this paper, we attempt to address the former problem by studying information of graphical structures (i.e., unlabeled graphs). In particular, we consider the Erdös-Rényi graphs G(n; p) over n vertices in which edges are added randomly with probability p. We prove that the structural entropy of G(n; p) is (2nh(p) − log n! + o(1) = (2nh(p) − n log n + O(n); where h(p) = −p log p − (1 − p) log(1 − p) is the entropy rate of a conventional memoryless binary source. Then, we design a two-stage encoding that optimally compresses unlabeled graphs up to the first two leading terms of the structural entropy.

IEEE Journal on Selected Areas in Communications | 2009

Energy-optimal distributed algorithms for minimum spanning trees

Yongwook Choi; Gopal Pandurangan; Maleq Khan; V.S.A. Kumar

Traditionally, the performance of distributed algorithms has been measured in terms of time and message complexity.Message complexity concerns the number of messages transmitted over all the edges during the course of the algorithm. However, in energy-constrained ad hoc wireless networks (e.g., sensor networks), energy is a critical factor in measuring the efficiency of a distributed algorithm. Transmitting a message between two nodes has an associated cost (energy) and moreover this cost can depend on the two nodes (e.g., the distance between them among other things). Thus in addition to the time and message complexity, it is important to consider energy complexity that accounts for the total energy associated with the messages exchanged among the nodes in a distributed algorithm. This paper addresses the minimum spanning tree (MST) problem, a fundamental problem in distributed computing and communication networks. We study energy-efficient distributed algorithms for the Euclidean MST problem assuming random distribution of nodes. We show a non-trivial lower bound of Omega(log n) on the energy complexity of any distributed MST algorithm. We then give an energy-optimal distributed algorithm that constructs an optimal MST with energy complexity O(log n) on average and O(log n log log n) with high probability. This is an improvement over the previous best known bound on the average energy complexity of Omega(log2 n). Our energy-optimal algorithm exploits a novel property of the giant component of sparse random geometric graphs. All of the above results assume that nodes do not know their geometric coordinates. If the nodes know their own coordinates, then we give an algorithm with O(1) energy complexity (which is the best possible) that gives an O(1) approximation to the MST.

Plant and Cell Physiology | 2015

A Maize Database Resource that Captures Tissue-Specific and Subcellular-Localized Gene Expression, via Fluorescent Tags and Confocal Imaging (Maize Cell Genomics Database)

Vivek Krishnakumar; Yongwook Choi; Erin Beck; Qingyu Wu; Anding Luo; Anne W. Sylvester; David Jackson; Agnes P. Chan

Maize is a global crop and a powerful system among grain crops for genetic and genomic studies. However, the development of novel biological tools and resources to aid in the functional identification of gene sequences is greatly needed. Towards this goal, we have developed a collection of maize marker lines for studying native gene expression in specific cell types and subcellular compartments using fluorescent proteins (FPs). To catalog FP expression, we have developed a public repository, the Maize Cell Genomics (MCG) Database, (http://maize.jcvi.org/cellgenomics), to organize a large data set of confocal images generated from the maize marker lines. To date, the collection represents major subcellular structures and also developmentally important progenitor cell populations. The resource is available to the research community, for example to study protein localization or interactions under various experimental conditions or mutant backgrounds. A subset of the marker lines can also be used to induce misexpression of target genes through a transactivation system. For future directions, the image repository can be expanded to accept new image submissions from the research community, and to perform customized large-scale computational image analysis. This community resource will provide a suite of new tools for gaining biological insights by following the dynamics of protein expression at the subcellular, cellular and tissue levels.

Wound Repair and Regeneration | 2016

Electrochemical detection of Pseudomonas in wound exudate samples from patients with chronic wounds

Hunter J. Sismaet; Anirban Banerjee; Sean McNish; Yongwook Choi; Manolito Torralba; Sarah Lucas; Agnes P. Chan; Victoria K. Shanmugam; Edgar D. Goluch

In clinical practice, point‐of‐care diagnostic testing has progressed rapidly in the last decade. For the field of wound care, there is a compelling need to develop rapid alternatives for bacterial identification in the clinical setting, where it generally takes over 24 hours to receive a positive identification. Even new molecular and biochemical identification methods require an initial incubation period of several hours to obtain a sufficient number of cells prior to performing the analysis. Here we report the use of an inexpensive, disposable electrochemical sensor to detect pyocyanin, a unique, redox‐active quorum sensing molecule released by Pseudomonas aeruginosa, in wound fluid from patients with chronic wounds enrolled in the WE‐HEAL Study. By measuring the metabolite excreted by the cells, this electrochemical detection strategy eliminates sample preparation, takes less than a minute to complete, and requires only 7.5 μL of sample to complete the analysis. The electrochemical results were compared against 16S rRNA profiling using 454 pyrosequencing. Blind identification yielded 9 correct matches, 2 false negatives, and 3 false positives giving a sensitivity of 71% and specificity of 57% for detection of Pseudomonas. Ongoing enhancement and development of this approach with a view to develop a rapid point‐of‐care diagnostic tool is planned.

Genome Biology and Evolution | 2017

Dominance and Sexual Dimorphism Pervade the Salix purpurea L. Transcriptome

Craig H. Carlson; Yongwook Choi; Agnes P. Chan; Michelle J. Serapiglia; Christopher D. Town; Lawrence B. Smart

Abstract The heritability of gene expression is critical in understanding heterosis and is dependent on allele-specific regulation by local and remote factors in the genome. We used RNA-Seq to test whether variation in gene expression among F1 and F2 intraspecific Salix purpurea progeny is attributable to cis- and trans-regulatory divergence. We assessed the mode of inheritance based on gene expression levels and allele-specific expression for F1 and F2 intraspecific progeny in two distinct tissue types: shoot tip and stem internode. In addition, we explored sexually dimorphic patterns of inheritance and regulatory divergence among F1 progeny individuals. We show that in S. purpurea intraspecific crosses, gene expression inheritance largely exhibits a maternal dominant pattern, regardless of tissue type or pedigree. A significantly greater number of cis- and trans-regulated genes coincided with upregulation of the maternal parent allele in the progeny, irrespective of the magnitude, whereas the paternal allele was higher expressed for genes showing cis × trans or compensatory regulation. Importantly, consistent with previous genetic mapping results for sex in shrub willow, we have delimited sex-biased gene expression to a 2 Mb pericentromeric region on S. purpurea chr15 and further refined the sex determination region. Altogether, our results offer insight into the inheritance of gene expression in S. purpurea as well as evidence of sexually dimorphic expression which may have contributed to the evolution of dioecy in Salix.

international symposium on information theory | 2007

Pattern Matching in Constrained Sequences

Yongwook Choi; Wojciech Szpankowski

Constrained sequences find applications in communication, magnetic recording, and biology. In this paper, we restrict our attention to the so-called (d, k) constrained binary sequences in which any run of zeros must be of length at least d and at most k, where 0lesd<k. In some applications one needs to know the number of occurrences of a given pattern w in such sequences, for which we coin the term constrained pattern matching. For a given word w or a set of words W, we estimate the (conditional) probability of the number of occurrences of w in a (d, k) sequence generated by a memoryless source. As a by-product, we enumerate asymptotically the number of (d, k) sequences with exactly r occurrences of a given word w, and compute Shannon entropy of (d, k) sequences with a given number of occurrences of w. Throughout this paper we use techniques of analytic information theory such as combinatorial calculus, generating functions, and complex asymptotics.

Explore More