Byung-Hoon Park | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Byung-Hoon Park is active.

Explore More

Publication

Featured researches published by Byung-Hoon Park.

Bioinformatics | 2008

From pull-down data to protein interaction networks and complexes with biological relevance

Bing Zhang; Byung-Hoon Park; Tatiana V. Karpinets; Nagiza F. Samatova

MOTIVATION Recent improvements in high-throughput Mass Spectrometry (MS) technology have expedited genome-wide discovery of protein-protein interactions by providing a capability of detecting protein complexes in a physiological setting. Computational inference of protein interaction networks and protein complexes from MS data are challenging. Advances are required in developing robust and seamlessly integrated procedures for assessment of protein-protein interaction affinities, mathematical representation of protein interaction networks, discovery of protein complexes and evaluation of their biological relevance. RESULTS A multi-step but easy-to-follow framework for identifying protein complexes from MS pull-down data is introduced. It assesses interaction affinity between two proteins based on similarity of their co-purification patterns derived from MS data. It constructs a protein interaction network by adopting a knowledge-guided threshold selection method. Based on the network, it identifies protein complexes and infers their core components using a graph-theoretical approach. It deploys a statistical evaluation procedure to assess biological relevance of each found complex. On Saccharomyces cerevisiae pull-down data, the framework outperformed other more complicated schemes by at least 10% in F(1)-measure and identified 610 protein complexes with high-functional homogeneity based on the enrichment in Gene Ontology (GO) annotation. Manual examination of the complexes brought forward the hypotheses on cause of false identifications. Namely, co-purification of different protein complexes as mediated by a common non-protein molecule, such as DNA, might be a source of false positives. Protein identification bias in pull-down technology, such as the hydrophilic bias could result in false negatives.

international conference on parallel processing | 2009

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

Rinku Gupta; Peter H. Beckman; Byung-Hoon Park; Ewing L. Lusk; Paul Hargrove; Al Geist; Dhabaleswar K. Panda; Andrew Lumsdaine; Jack J. Dongarra

Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.

dependable systems and networks | 2009

System log pre-processing to improve failure prediction

Ziming Zheng; Zhiling Lan; Byung-Hoon Park; Al Geist

Log preprocessing, a process applied on the raw log before applying a predictive method, is of paramount importance to failure prediction and diagnosis. While existing filtering methods have demonstrated good compression rate, they fail to preserve important failure patterns that are crucial for failure analysis. To address the problem, in this paper we present a log preprocessing method. It consists of three integrated steps: (1) event categorization to uniformly classify system events and identify fatal events; (2) event filtering to remove temporal and spatial redundant records, while also preserving necessary failure patterns for failure analysis; (3) causality-related filtering to combine correlated events for filtering through apriori association rule mining. We demonstrate the effectiveness of our preprocessing method by using real failure logs collected from the Cray XT4 at ORNL and the Blue Gene/L system at SDSC. Experiments show that our method can preserve more failure patterns for failure analysis, thereby improving failure prediction by up to 174%.

Molecular Systems Biology | 2009

Network-assisted protein identification and data interpretation in shotgun proteomics

Jing Li; Lisa J. Zimmerman; Byung-Hoon Park; David L. Tabb; Daniel C. Liebler; Bing Zhang

Protein assembly and biological interpretation of the assembled protein lists are critical steps in shotgun proteomics data analysis. Although most biological functions arise from interactions among proteins, current protein assembly pipelines treat proteins as independent entities. Usually, only individual proteins with strong experimental evidence, that is, confident proteins, are reported, whereas many possible proteins of biological interest are eliminated. We have developed a clique‐enrichment approach (CEA) to rescue eliminated proteins by incorporating the relationship among proteins as embedded in a protein interaction network. In several data sets tested, CEA increased protein identification by 8–23% with an estimated accuracy of 85%. Rescued proteins were supported by existing literature or transcriptome profiling studies at similar levels as confident proteins and at a significantly higher level than abandoned ones. Applying CEA on a breast cancer data set, rescued proteins coded by well‐known breast cancer genes. In addition, CEA generated a network view of the proteins and helped show the modular organization of proteins that may underpin the molecular mechanisms of the disease.

international conference on tools with artificial intelligence | 2006

Multi-Criterion Active Learning in Conditional Random Fields

Christopher T. Symons; Nagiza F. Samatova; Ramya Krishnamurthy; Byung-Hoon Park; Tarik Umar; David Buttler; Terence Critchlow; David Hysom

Conditional random fields (CRFs), which are popular supervised learning models for many natural language processing (NLP) tasks, typically require a large collection of labeled data for training. In practice, however, manual annotation of text documents is quite costly. Furthermore, even large labeled training sets can have arbitrarily limited performance peaks if they are not chosen with care. This paper considers the use of multi-criterion active learning for identification of a small but sufficient set of text samples for training CRFs. Our empirical results demonstrate that our method is capable of reducing the manual annotation costs, while also limiting the retraining costs that are often associated with active learning. In addition, we show that the generalization performance of CRFs can be enhanced through judicious selection of training examples

Computational Statistics & Data Analysis | 2007

Sampling streaming data with replacement

Byung-Hoon Park; George Ostrouchov; Nagiza F. Samatova

Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. Reservoir sampling without replacement has been extensively studied and several algorithms with sub-linear time complexity exist. Although reservoir sampling with replacement is previously mentioned by some authors, it has been studied very little and only linear algorithms exist. A with-replacement reservoir sampling algorithm of sub-linear time complexity is introduced. A thorough complexity analysis of several approaches to the with-replacement reservoir sampling problem is also provided.

Journal of Physics: Conference Series | 2008

Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs

Nagiza F. Samatova; Matthew C. Schmidt; William Hendrix; Paul Breimyer; Kevin Thomas; Byung-Hoon Park

Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP-hard nature of such problems, however, challenges existing methods to meet the required scale of data size and uncertainty, even on modern supercomputers. Maximal clique enumeration (MCE) in a graph derived from such biological data is often a rate-limiting step in detecting protein complexes in protein interaction data, finding clusters of co-expressed genes in microarray data, or identifying clusters of orthologous genes in protein sequence data. We report two key advances that address this challenge. We designed and implemented the first (to the best of our knowledge) parallel MCE algorithm that scales linearly on thousands of processors running MCE on real-world biological networks with thousands and hundreds of thousands of vertices. In addition, we proposed and developed the Graph Perturbation Theory (GPT) that establishes a foundation for efficiently solving the MCE problem in perturbed graphs, which model the uncertainty in the data. GPT formulates necessary and sufficient conditions for detecting the differences between the sets of maximal cliques in the original and perturbed graphs and reduces the enumeration time by more than 80% compared to complete recomputation.

acs/ieee international conference on computer systems and applications | 2008

Rapid and robust ranking of text documents in a dynamically changing corpus

Byung-Hoon Park; Nagiza F. Samatova; Rajesh Munavalli; Ramya Krishnamurthy; Houssain Kettani; Al Geist

Ranking documents in a selected corpus plays an important role in information retrieval systems. Despite notable advances in this direction, with continuously accumulating text documents, maintaining up-to-date ordering among documents in the domains of interest is a challenging task. Conventional approaches can produce an ordering that is only valid within a given corpus. Thus, with such approaches, ordering should be completely redone as documents are added to or deleted from the corpus. In this paper, we introduce a corpus- independent framework for rapid ordering of documents in a dynamically changing corpus. Like in many practical approaches, our framework suggests utilizing a similarity measure in some metric space indicating the degree of relevance of a document to the domain of interest. However, unlike in corpus- dependent approaches, the relevance score of a document remains valid with changes being introduced into the corpus (insertion of new documents, for example), thus allowing a rapid ordering within the corpus. This paper particularly details a statistical approach to compute such relevance scores.

international joint conferences on bioinformatics, systems biology and intelligent computing | 2009

Network Approaches for Shotgun Proteomics Data Analysis

Bing Zhang; Jing Li; David L. Tabb; Byung-Hoon Park

Shotgun proteomics has emerged as a powerful technology for protein identification with remarkable applications in discovering disease biomarkers. Protein assembly and biological interpretation of the assembled protein lists are critical steps in shotgun proteomics data analysis. Although most biological functions arise from interactions among proteins, current protein assembly pipelines treat proteins as independent entities. Usually, only individual proteins with strong experimental evidence (confident proteins) are reported, while many possible proteins of potential biological interest are eliminated. In biomarker studies, this conservative assembly may prevent us from identifying important biomarker candidates. In this study, we have developed a protein interaction network-assisted complex-enrichment approach (CEA) to improve protein identification by taking into consideration the functional relationship among proteins as embedded in protein interaction networks. CEA is based on the assumption that an eliminated protein is more likely to be present in the original sample if it is a member of a complex for which other members have been confidently identified in the same sample. Using a mouse organ data set and a mouse breast cancer data set, we show that CEA significantly improves protein identification and biological interpretation in shotgun proteomics data. First, we demonstrated the accuracy of CEA through cross-validation studies. CEA achieved an accuracy of 0.90 with a sensitivity of 0.45 in the mouse organ data set. Secondly, applying CEA on the eliminated proteins rescued 171, 156 and 181 proteins in the brain, placenta, and lung samples respectively, corresponding to 12%, 11%, and 10% increases in protein identifications in each organ proteome. Rescued proteins were supported by existing literature or transcriptome profiling studies at similar levels as the confidently identified proteins and at a significantly higher level than the abandoned ones. Finally, in the mouse breast cancer data set, CEA increased protein identification by 8% and 23% in the tumor and normal tissues, respectively. Among the 95 rescued proteins in the tumor tissue, 95% and 33% had been reported in cancer- and breast cancer-related publications, including products from some well-known breast cancer genes such as Ctnnb1 and Top1. Moreover, CEA makes it possible to compare proteomes at a network level. Comparison of the normal and tumor tissue-specific sub-networks identified some important processes involved in tumor biogenesis and progression, such as “ apoptosis” “ cell adhesion” and “ Wnt receptor signaling pathway” et al. In conclusion, CEA is an accurate approach that can be easily incorporated into routine shotgun proteomics protein assembly pipelines to improve protein identification. In addition, CEA generates a network view of the proteins and helps reveal the modular organization of proteins that may underpin the molecular mechanisms of the disease.

bioinformatics and biomedicine | 2007

Multi-stage Framework to Infer Protein Functional Modules from Mass Spectrometry Pull-Down Data with Assessment of Biological Relevance

Byung-Hoon Park; Bing Zhang; Tatiana V. Karpinets; Nagiza F. Samatova

Protein functional modules are fundamental units in protein interaction networks. High-throughput Mass Spectrometry (MS) technology has become valuable for discovery of protein functional modules. Yet, their computational inference from MS pull-down data and biological significance evaluation are still challenging. This paper introduces an integrated multi-step framework for (1) assessing protein-protein interaction affinities, (2) constructing a genome-wide protein association map, (3) finding putative protein functional modules, and (4) evaluating their biological relevance. The protein affinity score utilizes co- purification pattern of two proteins and adopts an information theoretic-approach to build the protein affinity map. Putative protein modules are then derived using a graph-theoretical approach. A two-stage statistical procedure assesses biological relevance of identified modules. On Saccharomyces cerevisiaes pull-down data (Nature, vol. 415, pp. 141-7, 2002), the scoring scheme outperformed other methods by at least 10% in F1-measure, and statistical tests identified 489 protein modules enriched in all of three general GO categories with p-values less than 0.05.

Explore More