Hyunjung Shin
Ajou University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hyunjung Shin.
european conference on computational biology | 2005
Koji Tsuda; Hyunjung Shin; Bernhard Schölkopf
MOTIVATION Support vector machines (SVMs) have been successfully used to classify proteins into functional categories. Recently, to integrate multiple data sources, a semidefinite programming (SDP) based SVM method was introduced. In SDP/SVM, multiple kernel matrices corresponding to each of data sources are combined with weights obtained by solving an SDP. However, when trying to apply SDP/SVM to large problems, the computational cost can become prohibitive, since both converting the data to a kernel matrix for the SVM and solving the SDP are time and memory demanding. Another application-specific drawback arises when some of the data sources are protein networks. A common method of converting the network to a kernel matrix is the diffusion kernel method, which has time complexity of O(n(3)), and produces a dense matrix of size n x n. RESULTS We propose an efficient method of protein classification using multiple protein networks. Available protein networks, such as a physical interaction network or a metabolic network, can be directly incorporated. Vectorial data can also be incorporated after conversion into a network by means of neighbor point connection. Similar to the SDP/SVM method, the combination weights are obtained by convex optimization. Due to the sparsity of network edges, the computation time is nearly linear in the number of edges of the combined network. Additionally, the combination weights provide information useful for discarding noisy or irrelevant networks. Experiments on function prediction of 3588 yeast proteins show promising results: the computation time is enormously reduced, while the accuracy is still comparable to the SDP/SVM method. AVAILABILITY Software and data will be available on request.
Expert Systems With Applications | 2006
Hyunjung Shin; Sungzoon Cho
Support Vector Machine (SVM) employs Structural Risk Minimization (SRM) principle to generalize better than conventional machine learning methods employing the traditional Empirical Risk Minimization (ERM) principle. When applying SVM to response modeling in direct marketing, however, one has to deal with the practical difficulties: large training data, class imbalance and scoring from binary SVM output. For the first difficulty, we propose a way to alleviate or solve it through a novel informative sampling. For the latter two difficulties, we provide guidelines within SVM framework so that one can readily use the paper as a quick reference for SVM response modeling: use of different costs for different classes and use of distance to decision boundary, respectively. This paper also provides various evaluation measures for response models in terms of accuracies, lift chart analysis, and computational efficiency.
Neural Computation | 2007
Hyunjung Shin; Sungzoon Cho
The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training.
Bioinformatics | 2007
Hyunjung Shin; Andreas Martin Lisewski; Olivier Lichtarge
MOTIVATION Predicting protein function is a central problem in bioinformatics, and many approaches use partially or fully automated methods based on various combination of sequence, structure and other information on proteins or genes. Such information establishes relationships between proteins that can be modelled most naturally as edges in graphs. A priori, however, it is often unclear which edges from which graph may contribute most to accurate predictions. For that reason, one established strategy is to integrate all available sources, or graphs as in graph integration, in the hope that the positive signals will add to each other. However, in the problem of functional prediction, noise, i.e. the presence of inaccurate or false edges, can still be large enough that integration alone has little effect on prediction accuracy. In order to reduce noise levels and to improve integration efficiency, we present here a recent method in graph-based learning, graph sharpening, which provides a theoretically firm yet intuitive and practical approach for disconnecting undesirable edges from protein similarity graphs. This approach has several attractive features: it is quick, scalable in the number of proteins, robust with respect to errors and tolerant of very diverse types of protein similarity measures. RESULTS We tested the classification accuracy in a test set of 599 proteins with remote sequence homology spread over 20 Gene Ontology (GO) functional classes. When compared to integration alone, graph sharpening plus integration of four vastly different molecular similarity measures improved the overall classification by nearly 30% [0.17 average increase in the area under the ROC curve (AUC)]. Moreover, and partially through the increased sparsity of the graphs induced by sharpening, this gain in accuracy came at negligible computational cost: sharpening and integration took on average 4.66 (+/-4.44) CPU seconds. AVAILABILITY Software and Supplementary data will be available on http://mammoth.bcm.tmc.edu/
Pattern Recognition Letters | 2005
Hyunjung Shin; Sungzoon Cho
If the training pattern set is large, it takes a large memory and a long time to train support vector machine (SVM). Recently, we proposed neighborhood property based pattern selection algorithm (NPPS) which selects only the patterns that are likely to be near the decision boundary ahead of SVM training Proc. of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Lecture Notes in Artificial Intelligence (LNAI 2637), Seoul, Korea, pp. 376-387. NPPS tries to identify those patterns that are likely to become support vectors in feature space. Preliminary reports show its effectiveness: SVM training time was reduced by two orders of magnitude with almost no loss in accuracy for various datasets. It has to be noted, however, that decision boundary of SVM and support vectors are all defined in feature space while NPPS described above operates in input space. If neighborhood relation in input space is not preserved in feature space, NPPS may not always be effective. In this paper, we show that the neighborhood relation is invariant under input to feature space mapping. The result assures that the patterns selected by NPPS in input space are likely to be located near decision boundary in feature space.
international conference of the ieee engineering in medicine and biology society | 2008
Muhammad Umer Khan; Jong Pill Choi; Hyunjung Shin; Minkoo Kim
Data analysis systems, intended to assist a physician, are highly desirable to be accurate, human interpretable and balanced, with a degree of confidence associated with final decision. In cancer prognosis, such systems estimate recurrence of disease and predict survival of patient; hence resulting in improved patient management. To develop such a prognostic system, this paper proposes to investigate a hybrid scheme based on fuzzy decision trees, as an efficient alternative to crisp classifiers that are applied independently. Experiments were performed using different combinations of: number of decision tree rules, types of fuzzy membership functions and inference techniques. For this purpose, SEER breast cancer data set (1973-2003), the most comprehensible source of information on cancer incidence in United States, is considered. Performance comparisons suggest that, for cancer prognosis, hybrid fuzzy decision tree classification is more robust and balanced than independently applied crisp classification; moreover it has a potential to adapt for significant performance enhancement.
Journal of the American Medical Informatics Association | 2014
Dokyoon Kim; Je-Gun Joung; Kyung-Ah Sohn; Hyunjung Shin; Yu Rang Park; Marylyn D. Ritchie; Ju Han Kim
Objective Cancer can involve gene dysregulation via multiple mechanisms, so no single level of genomic data fully elucidates tumor behavior due to the presence of numerous genomic variations within or between levels in a biological system. We have previously proposed a graph-based integration approach that combines multi-omics data including copy number alteration, methylation, miRNA, and gene expression data for predicting clinical outcome in cancer. However, genomic features likely interact with other genomic features in complex signaling or regulatory networks, since cancer is caused by alterations in pathways or complete processes. Methods Here we propose a new graph-based framework for integrating multi-omics data and genomic knowledge to improve power in predicting clinical outcomes and elucidate interplay between different levels. To highlight the validity of our proposed framework, we used an ovarian cancer dataset from The Cancer Genome Atlas for predicting stage, grade, and survival outcomes. Results Integrating multi-omics data with genomic knowledge to construct pre-defined features resulted in higher performance in clinical outcome prediction and higher stability. For the grade outcome, the model with gene expression data produced an area under the receiver operating characteristic curve (AUC) of 0.7866. However, models of the integration with pathway, Gene Ontology, chromosomal gene set, and motif gene set consistently outperformed the model with genomic data only, attaining AUCs of 0.7873, 0.8433, 0.8254, and 0.8179, respectively. Conclusions Integrating multi-omics data and genomic knowledge to improve understanding of molecular pathogenesis and underlying biology in cancer should improve diagnostic and prognostic indicators and the effectiveness of therapies.
Expert Systems With Applications | 2012
Hyunjung Shin; Hayoung Park; Junwoo Lee; Won Chul Jhee
We propose a scoring model that detects outpatient clinics with abusive utilization patterns based on profiling information extracted from electronic insurance claims. The model consists of (1) scoring to quantify the degree of abusiveness and (2) segmentation to categorize the problematic providers with similar utilization patterns. We performed the modeling for 3705 Korean internal medicine clinics. We applied data from practitioner claims submitted to the National Health Insurance Corporation for outpatient care during the 3rd quarter of 2007 and used 4th quarter data to validate the model. We considered the Health Insurance Review and Assessment Services decisions on interventions to be accurate for model validation. We compared the conditional probability distributions of the composite degree of anomaly (CDA) score formulated for intervention and non-intervention groups. To assess the validity of the model, we examined confusion matrices by intervention history and group as defined by the CDA score. The CDA aggregated 38 indicators of abusiveness for individual clinics, which were grouped based on the CDAs, and we used the decision tree to further segment them into homogeneous clusters based on their utilization patterns. The validation indicated that the proposed model was largely consistent with the manual detection techniques currently used to identify potential abusers. The proposed model, which can be used to automate abuse detection, is flexible and easy to update. It may present an opportunity to fight escalating healthcare costs in the era of increasing availability of electronic healthcare information.
decision support systems | 2013
Hyunjung Shin; Tianya Hou; Kanghee Park; Chan-Kyoo Park; Sunghee Choi
Oil price prediction has long been an important determinant in the management of most sectors of industry across the world, and has therefore consistently required detailed research. However, existing approaches to oil price prediction have sometimes made it rather difficult to implement the complex interconnected relationship between the price of oil and other global/domestic economic factors. This has been complicated by the influence of the irregular impact caused by the economic factors that affect the oil price. Recently, a machine learning algorithm, known as semi-supervised learning (SSL) has emerged, whose strength is the ease it can bring to the network representation of entities and the explicitness of inference which is expressed through relations between different entities. Since an awareness of the network representation of complicated relations between economic factors including the oil price is natural in SSL, this method allows the effects of the impact of economic factors on the oil price to be assessed with improved accuracy. SSL has so far been exploited in dealing with the non time-series types of entity, but not for the time-series types. Therefore, the proposed study is to exploit the method of representing the network between these time-series entities, and to then employ SSL to forecast the upward and downward movement of oil prices. The proposed SSL approach will be tested using one-month-ahead monthly crude oil price predictions between January 1992 and June 2008.
acm symposium on applied computing | 2010
Irfan Ahmed; Kyung-suk Lhee; Hyunjung Shin; Manpyo Hong
This paper proposes two techniques to reduce the classification time of content-based file type identification. The first is a feature selection technique, which uses a subset of highly-occurring byte patterns in building the representative model of a file type and classifying files. The second is a content sampling technique, which uses a subset of file content in obtaining its byte-frequency distribution. Our initial experiments show that the proposed approaches are promising even the simple 1-gram features are used for the classification.