Jong Youl Choi
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jong Youl Choi.
BMC Bioinformatics | 2010
Judy Qiu; Jaliya Ekanayake; Thilina Gunarathne; Jong Youl Choi; Seung-Hee Bae; Hui Li; Bingjing Zhang; Tak-Lon Wu; Yang Ruan; Saliya Ekanayake; Adam Hughes; Geoffrey C. Fox
BackgroundClouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister.ResultsComparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications.ConclusionsThe hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications.MethodsWe used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.
Concurrency and Computation: Practice and Experience | 2011
Thilina Gunarathne; Tak Lon Wu; Jong Youl Choi; Seung-Hee Bae; Judy Qiu
Cloud computing offers exciting new approaches for scientific computing that leverage major commercial players’ hardware and software investments in large‐scale data centers. Loosely coupled problems are very important in many scientific fields, and with the ongoing move towards data‐intensive computing, they are on the rise. There exist several different approaches to leveraging clouds and cloud‐oriented data processing frameworks to perform pleasingly parallel (also called embarrassingly parallel) computations. In this paper, we present three pleasingly parallel biomedical applications: (i) assembly of genome fragments; (ii) sequence alignment and similarity search; and (iii) dimension reduction in the analysis of chemical structures, which are implemented utilizing a cloud infrastructure service‐based utility computing models of Amazon Web Services (http://Amazon.com Inc., Seattle, WA, USA) and Microsoft Windows Azure (Microsoft Corp., Redmond, WA, USA) as well as utilizing MapReduce‐based data processing frameworks Apache Hadoop (Apache Software Foundation, Los Angeles, CA, USA) and Microsoft DryadLINQ. We review and compare each of these frameworks, performing a comparative study among them based on performance, cost, and usability. High latency, eventually consistent cloud infrastructure service‐based frameworks that rely on off‐the‐node cloud storage were able to exhibit performance efficiencies and scalability comparable to the MapReduce‐based frameworks with local disk‐based storage for the applications considered. In this paper, we also analyze variations in cost among the different platform choices (e.g., Elastic Compute Cloud instance types), highlighting the importance of selecting an appropriate platform based on the nature of the computation. Copyright
Concurrency and Computation: Practice and Experience | 2014
Qing Liu; Jeremy Logan; Yuan Tian; Hasan Abbasi; Norbert Podhorszki; Jong Youl Choi; Scott Klasky; Roselyne Tchoua; Jay F. Lofstead; Ron A. Oldfield; Manish Parashar; Nagiza F. Samatova; Karsten Schwan; Arie Shoshani; Matthew Wolf; Kesheng Wu; Weikuan Yu
Applications running on leadership platforms are more and more bottlenecked by storage input/output (I/O). In an effort to combat the increasing disparity between I/O throughput and compute capability, we created Adaptable IO System (ADIOS) in 2005. Focusing on putting users first with a service oriented architecture, we combined cutting edge research into new I/O techniques with a design effort to create near optimal I/O methods. As a result, ADIOS provides the highest level of synchronous I/O performance for a number of mission critical applications at various Department of Energy Leadership Computing Facilities. Meanwhile ADIOS is leading the push for next generation techniques including staging and data processing pipelines. In this paper, we describe the startling observations we have made in the last half decade of I/O research and development, and elaborate the lessons we have learned along this journey. We also detail some of the challenges that remain as we look toward the coming Exascale era. Copyright
high performance distributed computing | 2010
Seung-Hee Bae; Jong Youl Choi; Judy Qiu; Geoffrey C. Fox
The recent explosion of publicly available biology gene sequences and chemical compounds offers an unprecedented opportunity for data mining. To make data analysis feasible for such vast volume and high-dimensional scientific data, we apply high performance dimension reduction algorithms. It facilitates the investigation of unknown structures in a three dimensional visualization. Among the known dimension reduction algorithms, we utilize the multidimensional scaling and generative topographic mapping algorithms to configure the given high-dimensional data into the target dimension. However, both algorithms require large physical memory as well as computational resources. Thus, the authors propose an interpolated approach to utilizing the mapping of only a subset of the given data. This approach effectively reduces computational complexity. With minor trade-off of approximation, interpolation method makes it possible to process millions of data points with modest amounts of computation and memory requirement. Since huge amount of data are dealt, we represent how to parallelize proposed interpolation algorithms, as well. For the evaluation of the interpolated MDS by STRESS criteria, it is necessary to compute symmetric all pairwise computation with only subset of required data per process, so we also propose a simple but efficient parallel mechanism for the symmetric all pairwise computation when only a subset of data is available to each process. Our experimental results illustrate that the quality of interpolated mapping results are comparable to the mapping results of original algorithm only. In parallel performance aspect, those interpolation methods are well parallelized with high efficiency. With the proposed interpolation method, we construct a configuration of two-million out-of-sample data into the target dimension, and the number of out-of-sample data can be increased further.
international conference on cloud computing | 2009
Geoffrey C. Fox; Xiaohong Qiu; Scott Beason; Jong Youl Choi; Jaliya Ekanayake; Thilina Gunarathne; Mina Rho; Haixu Tang; Neil Devadasan; Gilbert C. Liu
Many areas of science are seeing a data deluge coming from new instruments, myriads of sensors and exponential growth in electronic records. We take two examples --- one the analysis of gene sequence data (35339 Alu sequences) and other a study of medical information (over 100,000 patient records) in Indianapolis and their relationship to Geographic and Information System and Census data available for 635 Census Blocks in Indianapolis. We look at initial processing (such as Smith Waterman dissimilarities), clustering (using robust deterministic annealing) and Multi Dimensional Scaling to map high dimension data to 3D for convenient visualization. We show how scaling pipelines can be produced that can be implemented using either cloud technologies or MPI which are compared. This study illustrates challenges in integrating data exploration tools with a variety of different architectural requirements and natural programming models. We present preliminary results for end to end study of two complete applications.
grid computing | 2010
Jong Youl Choi; Seung-Hee Bae; Xiaohong Qiu; Geoffrey C. Fox
Large high dimension datasets are of growing importance in many fields and it is important to be able to visualize them for understanding the results of data mining approaches or just for browsing them in a way that distance between points in visualization (2D or 3D) space tracks that in original high dimensional space. Dimension reduction is a well understood approach but can be very time and memory intensive for large problems. Here we report on parallel algorithms for Scaling by MAjorizing a Complicated Function (SMACOF) to solve Multidimensional Scaling problem and Generative Topographic Mapping (GTM). The former is particularly time consuming with complexity that grows as square of data set size but has advantage that it does not require explicit vectors for dataset points but just measurement of inter-point dissimilarities. We compare SMACOF and GTM on a subset of the NIH PubChem database which has binary vectors of length 166 bits. We find good parallel performance for both GTM and SMACOF and strong correlation between the dimension-reduced PubChem data from these two methods.
high performance distributed computing | 2017
Bing Xie; Yezhou Huang; Jeffrey S. Chase; Jong Youl Choi; Scott Klasky; Jay F. Lofstead; Sarp Oral
In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan---the 3rd fastest supercomputer in the world---and its Lustre-based multi-stage write path. We observe from Titan that although output performance is highly variable at small time scales, the mean performance is stable and consistent over typical application run times. Moreover, we find that output performance is non-linearly related to its correlated parameters due to interference and saturation on individual stages on the path. These observations enable us to build a predictive model of expected write times of output patterns and I/O configurations, using feature transformations to capture non-linear relationships. We identify the candidate features based on the structure of the Lustre/Titan write path, and use feature transformation functions to produce a model space with 135,000 candidate models. By searching for the minimal mean square error in this space we identify a good model and show that it is effective.
collaboration technologies and systems | 2008
Marlon E. Pierce; Geoffrey C. Fox; Joshua J. Rosen; Siddharth Maini; Jong Youl Choi
Web-based social networks, online personal profiles, keyword tagging, and online bookmarking are staples of Web 2.0-style applications. In this paper we report our investigation and implementation of these capabilities as a means for creating communities of like-minded faculty and researchers, particularly at minority serving institutions. Our motivating problem is to provide outreach tools that broaden the participation of these groups in funded research activities, particularly in cyberinfrastructure and e-science. In this paper, we discuss the system design, implementation, social network seeding, and portal capabilities. Underlying our system, and folksonomy systems generally, is a graph- based data model that links external URLs, system users, and descriptive tags. We conclude with a survey of the applicability of clustering and other data mining techniques to these folksonomy graphs.
international conference on conceptual structures | 2010
Jong Youl Choi; Judy Qiu; Marlon E. Pierce; Geoffrey C. Fox
Abstract Generative Topographic Mapping (GTM) is an important technique for dimension reduction which has been successfully applied to many fields. However the usual Expectation-Maximization (EM) approach to GTM can easily get stuck in local minima and so we introduce a Deterministic Annealing (DA) approach to GTM which is more robust and less sensitive to initial conditions so we do not need to use many initial values to find good solutions. DA has been very successful in clustering, hidden Markov Models and Multidimensional Scaling but typically uses a fixed cooling schemes to control the temperature of the system. We propose a new cooling scheme which can adaptively adjust the choice of temperature in the middle of process to find better solutions. Our experimental measurements suggest that deterministic annealing improves the quality of GTM solutions.
Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization | 2015
James Kress; Scott Klasky; Norbert Podhorszki; Jong Youl Choi; Hank Childs; David Pugmire
In this position paper, we argue that the loosely coupled in situ processing paradigm will play an important role in high performance computing for the foreseeable future. Loosely coupled in situ is an enabling technique that addresses many of the current issues with tightly coupled in situ, including, ease-of-integration, usability, and fault tolerance. We survey the prominent positives and negatives of both tightly coupled and loosely coupled in situ and present our recommendation as to why loosely coupled in situ is an enabling technique that is here to stay. We then report on some recent experiences with loosely coupled in situ processing, in an effort to explore each of the discussed factors in a real-world environment.