Is this you? Create Your Porfile

Sang Woo Jun

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sang Woo Jun is active.

Explore More

Publication

Featured researches published by Sang Woo Jun.

international symposium on computer architecture | 2015

BlueDBM: an appliance for big data analytics

Sang Woo Jun; Ming Liu; Sungjin Lee; Jamey Hicks; John Ankcorn; Myron King; Shuotao Xu; Arvind

Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.

parallel computing technologies | 2009

Visualizing Potential Deadlocks in Multithreaded Programs

Byung-Chul Kim; Sang Woo Jun; Dae Joon Hwang; Yong-Kee Jun

It is important to analyze and identify potential deadlocks resident in multithreaded programs from a successful deadlock-free execution, because the nondeterministic nature of such programs may hide the errors during testing. Visualizing the runtime behaviors of locking operations makes it possible to debug such errors effectively, because it provides intuitive understanding of different feasible executions caused by nondeterminism. However, with previous visualization techniques, it is hard to capture alternate orders imposed by locks due to their representation of a partial-order over locking operations. This paper presents a novel graph, called lock-causality graph , which represents alternate orders over locking operations. A visualization tool implements the graph, and demonstrates its power using the classical dining-philosophers problem written in Java. The experiment result shows that the graph provides a simple but powerful representation of potential deadlocks in an execution instance not deadlocked.

field-programmable technology | 2012

ZIP-IO: Architecture for application-specific compression of Big Data

Sang Woo Jun; Kermin Fleming; Michael Adler; Joel S. Emer

We have entered the “Big Data” age: scaling of networks and sensors has led to exponentially increasing amounts of data. Compression is an effective way to deal with many of these large data sets, and application-specific compression algorithms have become popular in problems with large working sets. Unfortunately, these compression algorithms are often computationally difficult and can result in application-level slow-down when implemented in software. To address this issue, we investigate ZIP-IO, a framework for FPGA-accelerated compression. Using this system we demonstrate that an unmodified industrial software workload can be accelerated 3× while simultaneously achieving more than 1000× compression in its data set.

parallel computing technologies | 2009

A Tool for Detecting First Races in OpenMP Programs

Mun-Hye Kang; Ok-Kyoon Ha; Sang Woo Jun; Yong-Kee Jun

First race detection is especially important for effective debugging, because the removal of such races may make other affected races disappear. The previous tools can not guarantee that detected races are the first races to occur. We present a new tool to detect first races in a program with nested parallelism using a two-pass on-the-fly technique. To show accuracy, we empirically compare our tool with previous tools using a set of synthetic programs with OpenMP directives.

very large data bases | 2016

Bluecache: a scalable distributed flash-based key-value store

Shuotao Xu; Sungjin Lee; Sang Woo Jun; Ming Liu; Jamey Hicks; Arvind

A key-value store (KVS), such as memcached and Redis, is widely used as a caching layer to augment the slower persistent backend storage in data centers. DRAM-based KVS provides fast key-value access, but its scalability is limited by the cost, power and space needed by the machine cluster to support a large amount of DRAM. This paper offers a 10X to 100X cheaper solution based on flash storage and hardware accelerators. In BlueCache key-value pairs are stored in flash storage and all KVS operations, including the flash controller are directly implemented in hardware. Furthermore, BlueCache includes a fast interconnect between flash controllers to provide a scalable solution. We show that BlueCache has 4.18X higher throughput and consumes 25X less power than a flash-backed KVS software implementation on x86 servers. We further show that BlueCache can outperform DRAM-based KVS when the latter has more than 7.4% misses for a read-intensive aplication. BlueCache is an attractive solution for both rack-level appliances and data-center-scale key-value cache.

design, automation, and test in europe | 2016

minFlash: A minimalistic clustered flash array

Ming Liu; Sang Woo Jun; Sungjin Lee; Jamey Hicks; Arvind

NAND flash is seeing increasing adoption in the data center because of its orders of magnitude lower latency and higher bandwidth compared to hard disks. However, flash performance is often degraded by (i) inefficient storage I/O stack that hides flash characteristics under Flash Translation Layer (FTL), and (ii) long latency network protocols for distributed storage. In this paper, we propose a minimalistic clustered flash array (minFlash). First, minFlash exposes a simple, stable, error-free, shared-memory flash interface that enables the host to perform cross-layer flash management optimizations in file systems, databases and other user applications. Second, minFlash uses a controller-to-controller network to connect multiple flash drives with very little overhead. We envision minFlash to be used within a rack cluster of servers to provide fast scalable distributed flash storage. We show through benchmarks that minFlash can access both local and remote flash devices with negligible latency overhead, and it can expose near theoretical max performance of the NAND chips in a distributed setting.

field programmable custom computing machines | 2017

Terabyte Sort on FPGA-Accelerated Flash Storage

Sang Woo Jun; Shuotao Xu; Arvind

Sorting is one of the most fundamental and usefulapplications in computer science, and continues to be animportant tool in analyzing large datasets. An important andchallenging subclass of sorting problems involves sorting terabytescale datasets with hundreds of billions of records. Theconventional method of sorting such large amounts of datais to distribute the data and computation over a cluster ofmachines. Such solutions can be fast but are often expensiveand power-hungry. In this paper, we propose a solution basedon flash storage connected to a collection of FPGA-based sortingaccelerators that perform large-scale merge-sort in storage. Theaccelerators include highly efficient sorting networks and mergetrees that use bitonic sorting to emit multiple sorted valuesevery cycle. We show that by appropriate use of acceleratorswe can remove all the computation bottlenecks so that the endto-endsorting performance is limited only by the flash storagebandwidth. We demonstrate that our flash-based system matchesthe performance of existing distributed-cluster solutions of muchlarger scale. More importantly, our prototype is able to showalmost twice the power efficiency compared to the existingJoulesort record holder. An optimized system with less wastefulcomponents is projected to be four times more efficient comparedto the current record holder, sorting over 200,000 records perjoule of energy.

international symposium on computer architecture | 2018

GraFboost: using accelerated flash storage for external graph analytics

Sang Woo Jun; Andy Wright; Sizhuo Zhang; Shuotao Xu; Arvind

We describe GraFBoost, a flash-based architecture with hardware acceleration for external analytics of multi-terabyte graphs. We compare the performance of GraFBoost with 1 GB of DRAM against various state-of-the-art graph analytics software including FlashGraph, running on a 32-thread Xeon server with 128 GB of DRAM. We demonstrate that despite the relatively small amount of DRAM, GraFBoost achieves high performance with very large graphs no other system can handle, and rivals the performance of the fastest software platforms on sizes of graphs that existing platforms can handle. Unlike in-memory and semi-external systems, GraFBoost uses a constant amount of memory for all problems, and its performance decreases very slowly as graph sizes increase, allowing GraFBoost to scale to much larger problems than possible with existing systems while using much less resources on a single-node system. The key component of GraFBoost is the sort-reduce accelerator, which implements a novel method to sequentialize fine-grained random accesses to flash storage. The sort-reduce accelerator logs random update requests and then uses hardware-accelerated external sorting with interleaved reduction functions. GraFBoost also stores newly updated vertex values generated in each superstep of the algorithm lazily with the old vertex values to further reduce I/O traffic. We evaluate the performance of GraFBoost for PageRank, breadth-first search and betweenness centrality on our FPGA-based prototype (Xilinx VC707 with 1 GB DRAM and 1 TB flash) and compare it to other graph processing systems including a pure software implementation of GrapFBoost.

ieee high performance extreme computing conference | 2016

In-storage embedded accelerator for sparse pattern processing

Sang Woo Jun; Huy T. Nguyen; Vijay Gadepally; Arvind

We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.

ACM Transactions on Computer Systems | 2016

BlueDBM: Distributed Flash Storage for Big Data Analytics

Sang Woo Jun; Ming Liu; Sungjin Lee; Jamey Hicks; John Ankcorn; Myron King; Shuotao Xu; Arvind

Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data, and daily Twitter feeds, where the datasets of interest are 5TB to 20TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GB of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. However, currently available off-the-shelf flash storage packaged as SSDs does not make effective use of flash storage because it incurs a great amount of additional overhead during flash device management and network access. In this article, we present BlueDBM, a new system architecture that has flash-based storage with in-store processing capability and a low-latency high-throughput intercontroller network between storage devices. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a DRAM-centric system falls sharply even if only 5p to 10p of the references are to secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost/performance tradeoff for Big Data analytics.

Explore More