Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yungki Park is active.

Publication


Featured researches published by Yungki Park.


Nature Methods | 2012

Flaws in evaluation schemes for pair-input computational predictions

Yungki Park; Edward M. Marcotte

To the Editor: Computational prediction methods that operate on pairs of objects by considering features of each (hereafter referred to as “pair-input methods”) have been crucial in many areas of biology and chemistry over the past decade. Among the most prominent examples are protein-protein interaction (PPI)1-2, protein-drug interaction3-4, protein-RNA interaction5 and drug indication6 prediction methods. A sampling of more than fifty published studies involving pair-input methods is provided in Supplementary Table 1. In this study we demonstrate that the paired nature of inputs has significant, though not yet widely perceived, implications for the validation of pair-input methods. Given the paired nature of inputs for pair-input methods, one can envision evaluating their predictive performance on different classes of test pairs. As an example, proteochemometrics modeling3, a well-known computational methodology for predicting protein-drug interactions, takes a feature vector for a chemical and a feature vector for a protein receptor in order to predict the binding between the chemical and protein receptor3. In this case, a test pair may share either the chemical or protein component with some pairs in a training set; it may also share neither. We found that pair-input methods tend to perform much better for test pairs that share components with a training set than for those that do not. As a result, it is necessary to distinguish test pairs based on their component-level overlap when evaluating performance. A test set that is used to estimate predictive performance may be dominated by pairs that share components with a training set, yet such pairs may form only a minority of cases on the population level. In this case, a predictive performance estimated on the test set may be impressive, yet it should fail to generalize to the population level. Indeed, this component-level overlap issue for the validation of pair-input methods was early recognized by some researchers (e.g., by Vert, Yamanishi and others; see Supplementary Table 1). However, it has been overlooked by most researchers across biology and chemistry, and as a result cross-validations for pair-input methods usually did not distinguish test pairs based on the component-level overlap criterion. To illustrate the component-level overlap issue, we consider PPI prediction methods with the toy example of Fig. 1, in which the protein space is composed of 9 proteins and a training set consists of 4 positive and 4 negative protein pairs. This training set is used to train a PPI prediction method, which is in turn applied to the full set of 28 test pairs (Fig. 1). How well would the trained method perform on the 28 test pairs? To this end, one usually performs a cross-validation on the training set. For example, a temporary training set is prepared by randomly picking some pairs (Fig. 1) while the rest serve as a temporary test set from which predictive accuracy can be measured. This cross-validated predictive performance is then implicitly assumed to hold for the full space of 28 test pairs. Figure 1 Illustrating shortcomings of a typical cross-validation with a toy example of predicting protein-protein interactions. Here, the protein space contains 9 proteins and a training set consists of 4 interacting and 4 non-interacting protein pairs. The training ... The paired nature of inputs leads to a natural partitioning of the 28 test pairs into 3 distinct classes (C1 – C3), as shown in Fig. 1: C1, test pairs sharing both proteins with the training set; C2, test pairs sharing only one protein with the training set; and C3, test pairs sharing neither protein with the training set. To demonstrate that the predictive performance of pair-input methods differs significantly for distinct test classes, we performed computational experiments using large-scale yeast and human PPI data that mirror the toy example of Fig. 1 (Supplementary Methods). Supplementary Table 2 shows that, for seven PPI prediction methods (M1 – M7, chosen to be a representative set of algorithms, Supplementary Methods), the predictive performances for the three test classes differ significantly. The differences are not only statistically significant (Supplementary Table 3) but also numerically large in many cases. M1 – M4 are support vector machine (SVM)-based methods, M5 is based on the random forest algorithm, and M6 and M7 are heuristic methods. Thus, regardless of core predictive algorithms, significant differences for the three distinct test classes are consistently observed. These differences arise partly from the learning of differential representation of components among positive and negative training examples (Supplementary Discussion). In a typical cross-validation for pair-input methods, available data are randomly divided into a training set and a test set, without regard to the partitioning of test pairs into distinct classes. How representative would such randomly generated test sets be of full populations? To answer this question, we performed the typical cross-validation using the yeast and human PPI data of Supplementary Table 2. Not surprisingly, the C1 class accounted for more than 99% of each of the test sets generated for the typical cross-validations, and accordingly the cross-validated predictive performances closely match those for the C1 class (Supplementary Table 2). In contrast, within the full population (i.e., the set of possible human protein pairs), the C1 class represents only a minority of cases: 21,946 protein-coding human genes7 implies 240,802,485 possible human protein pairs. According to HIPPIE8, a meta-database integrating 10 public PPI databases, the space of C1 type human protein pairs (i.e. those pairs formed by proteins that are represented among highly confident PPIs) accounts for only 19.2% of these cases, compared with 49.2% and 31.6%, respectively, for the C2 and C3 classes. Hence, the C1 class is far less frequent at the population level than for typical cross-validation test sets, and performance estimates obtained by a typical cross-validation should not be expected to generalize to the full population level. Given that these yeast and human PPI data sets have also been broadly analyzed by others, this conclusion is very likely to hold generally, at least for pair-input PPI prediction methods. In summary, computational predictions—whether pair-input or not9-10—that are tested by cross-validation on non-representative subsets should not be expected to generalize to the full test populations. A unique aspect of pair-input methods, as compared with methods operating on single objects, is that one additionally needs to take into account the paired nature of inputs. We have demonstrated that 1) the paired nature of inputs leads to a natural partitioning of test pairs into distinct classes, and 2) pair-input methods achieve significantly different predictive performances for distinct test classes. We note that if one is only interested in the population of C1 test pairs, then typical cross-validations employing randomly generated test sets may be just fine, although this limitation should then be noted. For general-purpose pair-input methods, however, it is imperative to distinguish distinct classes of test pairs, and we propose that predictive performances should be reported separately for each distinct test class. In the case of PPI prediction methods, three independent predictive performances should be reported as in Supplementary Table 2. In the case of protein-drug interaction prediction methods, one should report four independent predictive performances, as either the protein or drug component of a test pair might each be found in training data.


BMC Bioinformatics | 2007

Prediction of the burial status of transmembrane residues of helical membrane proteins

Yungki Park; Sikander Hayat; Volkhard Helms

BackgroundHelical membrane proteins (HMPs) play a crucial role in diverse cellular processes, yet it still remains extremely difficult to determine their structures by experimental techniques. Given this situation, it is highly desirable to develop sequence-based computational methods for predicting structural characteristics of HMPs.ResultsWe have developed TMX (TransMembrane eXposure), a novel method for predicting the burial status (i.e. buried in the protein structure vs. exposed to the membrane) of transmembrane (TM) residues of HMPs. TMX derives positional scores of TM residues based on their profiles and conservation indices. Then, a support vector classifier is used for predicting their burial status. Its prediction accuracy is 78.71% on a benchmark data set, representing considerable improvements over 68.67% and 71.06% of previously proposed methods. Importantly, unlike the previous methods, TMX automatically yields confidence scores for the predictions made. In addition, a feature selection incorporated in TMX reveals interesting insights into the structural organization of HMPs.ConclusionA novel computational method, TMX, has been developed for predicting the burial status of TM residues of HMPs. Its prediction accuracy is much higher than that of previously proposed methods. It will be useful in elucidating structural characteristics of HMPs as an inexpensive, auxiliary tool. A web server for TMX is established at http://service.bioinformatik.uni-saarland.de/tmx and freely available to academic users, along with the data set used.


BMC Bioinformatics | 2009

Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences

Yungki Park

BackgroundProtein-protein interactions underlie many important biological processes. Computational prediction methods can nicely complement experimental approaches for identifying protein-protein interactions. Recently, a unique category of sequence-based prediction methods has been put forward - unique in the sense that it does not require homologous protein sequences. This enables it to be universally applicable to all protein sequences unlike many of previous sequence-based prediction methods. If effective as claimed, these new sequence-based, universally applicable prediction methods would have far-reaching utilities in many areas of biology research.ResultsUpon close survey, I realized that many of these new methods were ill-tested. In addition, newer methods were often published without performance comparison with previous ones. Thus, it is not clear how good they are and whether there are significant performance differences among them. In this study, I have implemented and thoroughly tested 4 different methods on large-scale, non-redundant data sets. It reveals several important points. First, significant performance differences are noted among different methods. Second, data sets typically used for training prediction methods appear significantly biased, limiting the general applicability of prediction methods trained with them. Third, there is still ample room for further developments. In addition, my analysis illustrates the importance of complementary performance measures coupled with right-sized data sets for meaningful benchmark tests.ConclusionsThe current study reveals the potentials and limits of the new category of sequence-based protein-protein interaction prediction methods, which in turn provides a firm ground for future endeavours in this important area of contemporary bioinformatics.


Bioinformatics | 2007

On the derivation of propensity scales for predicting exposed transmembrane residues of helical membrane proteins

Yungki Park; Volkhard Helms

Helical membrane proteins (HMPs) play a crucial role in diverse physiological processes. Given the difficulty in determining their structures by experimental techniques, it is desired to develop computational methods for predicting the burial status of transmembrane residues. Deriving a propensity scale for the 20 amino acids to be exposed to the lipid bilayer from known structures is central to developing such methods. A fundamental problem in this regard is what would be the optimal way of deriving propensity scales. Here, we show that this problem can be reformulated such that an optimal scale is straightforwardly obtained in an analytical fashion. The derived scale favorably compares with others in terms of both algorithmic optimality and practical prediction accuracy. It also allows interesting insights into the structural organization of HMPs. Furthermore, the presented approach can be applied to other bioinformatics problems of HMPs, too. All the data sets and programs used in the study and detailed primary results are available upon request.


Proteins | 2006

Assembly of transmembrane helices of simple polytopic membrane proteins from sequence conservation patterns

Yungki Park; Volkhard Helms

The transmembrane (TM) domains of most membrane proteins consist of helix bundles. The seemingly simple task of TM helix bundle assembly has turned out to be extremely difficult. This is true even for simple TM helix bundle proteins, i.e., those that have the simple form of compact TM helix bundles. Herein, we present a computational method that is capable of generating native‐like structural models for simple TM helix bundle proteins having modest numbers of TM helices based on sequence conservation patterns. Thus, the only requirement for our method is the presence of more than 30 homologous sequences for an accurate extraction of sequence conservation patterns. The prediction method first computes a number of representative well‐packed conformations for each pair of contacting TM helices, and then a library of tertiary folds is generated by overlaying overlapping TM helices of the representative conformations. This library is scored using sequence conservation patterns, and a subsequent clustering analysis yields five final models. Assuming that neighboring TM helices in the sequence contact each other (but not that TM helices A and G contact each other), the method produced structural models of Cα atom root‐mean‐square deviation (CA RMSD) of 3–5 Å from corresponding crystal structures for bacteriorhodopsin, halorhodopsin, sensory rhodopsin II, and rhodopsin. In blind predictions, this type of contact knowledge is not available. Mimicking this, predictions were made for the rotor of the V‐type Na+‐adenosine triphosphatase without such knowledge. The CA RMSD between the best model and its crystal structure is only 3.4 Å, and its contact accuracy reaches 55%. Furthermore, the model correctly identifies the binding pocket for sodium ion. These results demonstrate that the method can be readily applied to ab initio structure prediction of simple TM helix bundle proteins having modest numbers of TM helices. Proteins 2006.


Bioinformatics | 2008

Prediction of the translocon-mediated membrane insertion free energies of protein sequences

Yungki Park; Volkhard Helms

MOTIVATION Helical membrane proteins (HMPs) play crucial roles in a variety of cellular processes. Unlike water-soluble proteins, HMPs need not only to fold but also get inserted into the membrane to be fully functional. This process of membrane insertion is mediated by the translocon complex. Thus, it is of great interest to develop computational methods for predicting the translocon-mediated membrane insertion free energies of protein sequences. RESULT We have developed Membrane Insertion (MINS), a novel sequence-based computational method for predicting the membrane insertion free energies of protein sequences. A benchmark test gives a correlation coefficient of 0.74 between predicted and observed free energies for 357 known cases, which corresponds to a mean unsigned error of 0.41 kcal/mol. These results are significantly better than those obtained by traditional hydropathy analysis. Moreover, the ability of MINS to reasonably predict membrane insertion free energies of protein sequences allows for effective identification of transmembrane (TM) segments. Subsequently, MINS was applied to predict the membrane insertion free energies of 316 TM segments found in known structures. An in-depth analysis of the predicted free energies reveals a number of interesting findings about the biogenesis and structural stability of HMPs. AVAILABILITY A web server for MINS is available at http://service.bioinformatik.uni-saarland.de/mins


Journal of Bioinformatics and Computational Biology | 2011

Prediction of the exposure status of transmembrane beta barrel residues from protein sequence.

Sikander Hayat; Peter Walter; Yungki Park; Volkhard Helms

We present BTMX (Beta barrel TransMembrane eXposure), a computational method to predict the exposure status (i.e. exposed to the bilayer or hidden in the protein structure) of transmembrane residues in transmembrane beta barrel proteins (TMBs). BTMX predicts the exposure status of known TM residues with an accuracy of 84.2% over 2,225 residues and provides a confidence score for all predictions. Predictions made are in concert with the fact that hydrophobic residues tend to be more exposed to the bilayer. The biological relevance of the input parameters is also discussed. The highest prediction accuracy is obtained when a sliding window comprising three residues with similar C(α)-C(β) vector orientations is employed. The prediction accuracy of the BTMX method on a separate unseen non-redundant test dataset is 78.1%. By employing out-pointing residues that are exposed to the bilayer, we have identified various physico-chemical properties that show statistically significant differences between the beta strands located at the oligomeric interfaces compared to the non-oligomeric strands. The BTMX web server generates colored, annotated snake-plots as part of the prediction results and is available under the BTMX tab at http://service.bioinformatik.uni-saarland.de/tmx-site/. Exposure status prediction of TMB residues may be useful in 3D structure prediction of TMBs.


Nucleic Acids Research | 2017

Homo-trimerization is essential for the transcription factor function of Myrf for oligodendrocyte differentiation

Dongkyeong Kim; Jin-ok Choi; Chuandong Fan; Randall S. Shearer; Mohamed Sharif; Patrick Busch; Yungki Park

Abstract Myrf is a key transcription factor for oligodendrocyte differentiation and central nervous system myelination. We and others have previously shown that Myrf is generated as a membrane protein in the endoplasmic reticulum (ER), and that it undergoes auto-processing to release its N-terminal fragment from the ER, which enters the nucleus to work as a transcription factor. These previous studies allow a glimpse into the unusual complexity behind the biogenesis and function of the transcription factor domain of Myrf. Here, we report that Myrf N-terminal fragments assemble into stable homo-trimers before ER release. Consequently, Myrf N-terminal fragments are released from the ER only as homo-trimers. Our re-analysis of a previous genetic screening result in Caenorhabditis elegans shows that homo-trimerization is essential for the biological functions of Myrf N-terminal fragment, and that the region adjacent to the DNA-binding domain is pivotal to its homo-trimerization. Further, our computational analysis uncovered a novel homo-trimeric DNA motif that mediates the homo-trimeric DNA binding of Myrf N-terminal fragments. Importantly, we found that homo-trimerization defines the DNA binding specificity of Myrf N-terminal fragments. In sum, our study elucidates the molecular mechanism governing the biogenesis and function of Myrf N-terminal fragments and its physiological significance.


Bioinformatics | 2008

MINS2: Revisiting the molecular code for transmembrane-helix recognition by the Sec61 translocon

Yungki Park; Volkhard Helms

UNLABELLED To be fully functional, membrane proteins should not only fold, but also get inserted into the membrane, which is mediated by the Sec61 translocon. Recent experimental studies have attempted to elucidate how the Sec61 translocon accomplishes this delicate task by measuring the translocon-mediated membrane insertion free energies of 357 systematically designed peptides. On the basis of this data set, we have developed MINS2, a novel sequence-based computational method for predicting the membrane insertion free energies of protein sequences. A benchmark analysis of MINS2 shows that MINS2 signi.cantly outperforms previously proposed methods. Importantly, the application of MINS2 to known membrane protein structures shows that a better prediction of membrane insertion free energies does not lead to a better prediction of transmembrane segments of polytopic membrane proteins. AVAILABILITY A web server for MINS2 is publicly available at http://service.bioinformatik.uni-saarland.de/mins. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


Computational Biology and Chemistry | 2011

Research Article: Statistical analysis and exposure status classification of transmembrane beta barrel residues

Sikander Hayat; Yungki Park; Volkhard Helms

Several computational methods exist for the identification of transmembrane beta barrel proteins (TMBs) from sequence. Some of these methods also provide the transmembrane (TM) boundaries of the putative TMBs. The aim of this study is to (1) derive the propensities of the TM residues to be exposed to the lipid bilayer and (2) to predict the exposure status (i.e. exposed to the bilayer or hidden in protein structure) of TMB residues. Three novel propensity scales namely, BTMC, BTMI and HTMI were derived for the TMB residues at the hydrophobic core region of the outer membrane (OM), the lipid-water interface regions of the OM, and for the helical membrane proteins (HMPs) residues at the lipid-water interface regions of the inner membrane (IM), respectively. Separate propensity scales were derived for monomeric and functionally oligomeric TMBs. The derived propensities reflect differing physico-chemical properties of the respective membrane bilayer regions and were employed in a computational method for the prediction of the exposure status of TMB residues. Based on the these propensities, the conservation indices and the frequency profile of the residues, the transmembrane residues were classified into buried/exposed with an accuracy of 77.91% and 80.42% for the residues at the membrane core and the interface regions, respectively. The correlation of the derived scales with different physico-chemical properties obtained from the AAIndex database are also discussed. Knowledge about the residue propensities and burial status will be useful in annotating putative TMBs with unknown structure.

Collaboration


Dive into the Yungki Park's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Edward M. Marcotte

University of Texas at Austin

View shared research outputs
Top Co-Authors

Avatar

Sikander Hayat

Memorial Sloan Kettering Cancer Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zhihua Li

University of Texas at Austin

View shared research outputs
Researchain Logo
Decentralizing Knowledge