Sukhyun Song
University of Maryland, College Park
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sukhyun Song.
international parallel and distributed processing symposium | 2009
Sukhyun Song; Kyung Dong Ryu; Dilma Da Silva
With the advent of cloud computing, massive and automated system management has become more important for successful and economical operation of computing resources. However, traditional monolithic system management solutions are designed to scale to only hundreds or thousands of systems at most. In this paper, we present Blue Eyes, a new system management solution to handle hundreds of thousands of systems. Blue Eyes enables highly scalable and reliable system management with a multi-server scale-out architecture. In particular, we structure the management servers into a hierarchical tree to achieve scalability, and management information is replicated into secondary servers to provide reliability and high availability. In addition, Blue Eyes is designed to extend the existing single server implementation without significantly restructuring the code base. Several experimental results with the prototype have demonstrated that Blue Eyes can reliably handle typical management tasks for a large scale of endpoints with dynamic load-balancing across the servers, near linear performance gain with server additions, and an acceptable network overhead.
international conference on computer communications | 2011
Sukhyun Song; Peter J. Keleher; Bobby Bhattacharjee; Alan Sussman
The distributed nature of modern computing makes end-to-end prediction of network bandwidth increasingly important. Our work is inspired by prior work that treats the Internet and bandwidth as an approximate tree metric space. This paper presents a decentralized, accurate, and low cost system that predicts pairwise bandwidth between hosts. We describe an algorithm to construct a distributed tree that embeds bandwidth measurements. The correctness of the algorithm is provable when driven by precise measurements. We then describe three novel heuristics that achieve high accuracy for predicting bandwidth even with imprecise input data. Simulation experiments with a real-world dataset confirm that our approach shows high accuracy with low cost.
acm sigplan symposium on principles and practice of parallel programming | 2014
Sukhyun Song; Jeffrey K. Hollingsworth
This paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache performance through loop tiling. To cope with the complex trade-off regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76x over the FFTW library.
international conference on conceptual structures | 2015
Abhinav Sarje; Sukhyun Song; Douglas W. Jacobsen; Kevin A. Huck; Jeffrey K. Hollingsworth; Allen D. Malony; Samuel Williams; Leonid Oliker
Abstract This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.
international conference on distributed computing systems | 2011
Sukhyun Song; Peter J. Keleher; Alan Sussman
Data-intensive distributed applications can increase their performance by running on a cluster of hosts connected via high-bandwidth interconnections. However, there is no effective method to find such a bandwidth-constrained cluster in a decentralized fashion. Our work is inspired by prior work that treats Internet bandwidth as an approximate tree metric space. This paper presents a decentralized, accurate, and efficient method to find a cluster of Internet hosts, given the desired cluster size and minimum interconnection bandwidth. We describe a centralized polynomial time algorithm for a tree metric space, along with a proof of correctness. We then provide a decentralized version of the algorithm. Simulation experiments with two real-world datasets confirm that our clustering approach achieves high accuracy and scalability. We also discuss the costs of decentralization and how the treeness of the dataset affects clustering accuracy.
Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems | 2014
Sukhyun Song; Jeffrey K. Hollingsworth
This paper describes a new method for scalable high-performance parallel 3-D FFT. We use a 2-D decomposition of 3-D arrays to increase scaling to a large number of cores. In order to achieve high performance, we use non-blocking MPI all-to-all operations and exploit computation-communication overlap. We also auto-tune our 3-D FFT code efficiently in a large parameter space and cope with the complex trade-off in optimizing our code in various system environments. According to experimental results with up to 32,768 cores, our method computes parallel 3-D FFT faster than the FFTW library by up to 1.83×.
international parallel and distributed processing symposium | 2008
Kyung Dong Ryu; David Daly; Mary R. Seminara; Sukhyun Song; Paul G. Crumley
System management solutions are designed to scale to thousands or more machines and networked devices. However, it is challenging to test and verify the proper operation and scalability of management software given the limited resources of a testing lab. We have developed a method called agent multiplication, in which one physical testing machine is used to represent hundreds of client machines. This provides the necessary client load to test the performance and scalability of the management software and server within limited resources. In addition, our approach guarantees that the test environment remains consistent between test runs, ensuring that test results can be meaningfully compared. We used agent multiplication to test and verify the operation of a server managing 4,000 systems. We exercised the server functions with only 8 test machines. Applying this test environment to an early version of a real enterprise system management solution we were able to uncover critical bugs, resolve race conditions, and examine and adjust thread prioritization levels for improved performance.
Journal of Computational Science | 2016
Sukhyun Song; Jeffrey K. Hollingsworth
Abstract Parallel 3-D FFT is widely used in scientific applications, therefore it is important to achieve high performance on large-scale systems with many thousands of computing cores. This paper describes a new method for scalable high-performance parallel 3-D FFT. We use a 2-D decomposition of 3-D arrays to increase scaling to a large number of cores. In order to achieve high performance, we use non-blocking MPI all-to-all operations and exploit computation-communication overlap. We also auto-tune our 3-D FFT code efficiently in a large parameter space and cope with the complex trade-off in optimizing our code in various system environments. According to experimental results from two systems, our method computes parallel 3-D FFT significantly faster than three existing libraries, and scales well to at least 32,768 compute cores.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011
Sukhyun Song
My PhD research addresses how to exploit network bandwidth information and increase the performance of data-intensive wide-area distributed applications. The goal is to solve four specific problems: i) design a decentralized algorithm for network bandwidth prediction, ii) design a decentralized algorithm to find bandwidth-constrained clusters, iii) design a decentralized algorithm to find bandwidth-constrained centroids, and iv) develop a wide-area MapReduce system with optimized data locality as an application of the three algorithms for bandwidth prediction and node search.
Archive | 2014
Sukhyun Song; Jeffrey K. Hollingsworth