Is this you? Create Your Porfile

Keun Sup Shim

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keun Sup Shim is active.

Explore More

Publication

Featured researches published by Keun Sup Shim.

international symposium on performance analysis of systems and software | 2011

Scalable, accurate multicore simulation in the 1000-core era

Mieszko Lis; Pengju Ren; Myong Hyon Cho; Keun Sup Shim; Christopher W. Fletcher; Omer Khan; Srinivas Devadas

We present HORNET, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued worm-hole router NoC architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on 6 separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 11 ×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, and parameters driving power and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple DOR routing to complex Valiant, ROMM, or PROM schemes, BSOR, and adaptive routing. HORNET can run in network-only mode using synthetic traffic or traces, directly emulate a MIPS-based multicore, or function as the memory subsystem for native applications executed under the Pin instrumentation tool. HORNET is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2012

HORNET: A Cycle-Level Multicore Simulator

Pengju Ren; Mieszko Lis; Myong Hyon Cho; Keun Sup Shim; Christopher W. Fletcher; Omer Khan; Nanning Zheng; Srinivas Devadas

We present hornet, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued wormhole router network-on-chip (NoC) architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on six separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 12×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, parameters driving power, and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple dimension-ordered routing to complex Valiant, ROMM, O1Turn or PROM schemes, BSOR, and adaptive routing. Hornet can run in network-only mode using synthetic traffic or traces, or directly emulate a MIPS-based multicore. Hornet is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/.

international conference on parallel architectures and compilation techniques | 2009

Oblivious Routing in On-Chip Bandwidth-Adaptive Networks

Myong Hyon Cho; Mieszko Lis; Keun Sup Shim; Michel A. Kinsy; Tina Wen; Srinivas Devadas

Oblivious routing can be implemented on simple router hardware, but network performance suffers when routes become congested. Adaptive routing attempts to avoid hot spots by re-routing flows, but requires more complex hardware to determine and configure new routing paths. We propose onchip bandwidth-adaptive networks to mitigate the performance problems of oblivious routing and the complexity issues of adaptive routing. In a bandwidth-adaptive network, the bisection bandwidth of network can adapt to changing network conditions. We describe one implementation of a bandwidth-adaptive network in the form of a two-dimensional mesh with adaptive bidirectional links, where the bandwidth of the link in one direction can be increased at the expense of the other direction. Efficient local intelligence is used to reconfigure each link, and this reconfiguration can be done very rapidly in response to changing traffic demands. We compare the hardware designs of a unidirectional and bidirectional link and evaluate the performance gains provided by a bandwidth-adaptive network in comparison to a conventional network under uniform and bursty traffic when oblivious routing is used.

networks on chips | 2009

Static virtual channel allocation in oblivious routing

Keun Sup Shim; Myong Hyon Cho; Michel A. Kinsy; Tina Wen; Mieszko Lis; G. Edward Suh; Srinivas Devadas

Most virtual channel routers have multiple virtual channels to mitigate the effects of head-of-line blocking. When there are more flows than virtual channels at a link, packets or flows must compete for channels, either in a dynamic way at each link or by static assignment computed before transmission starts. In this paper, we present methods that statically allocate channels to flows at each link when oblivious routing is used, and ensure deadlock freedom for arbitrary minimal routes when two or more virtual channels are available. We then experimentally explore the performance trade-offs of static and dynamic virtual channel allocation for various oblivious routing methods, including DOR, ROMM, Valiant and a novel bandwidth-sensitive oblivious routing scheme (BSORM). Through judicious separation of flows, static allocation schemes often exceed the performance of dynamic allocation schemes.

international conference on computer design | 2011

Memory coherence in the age of multicores

Mieszko Lis; Keun Sup Shim; Myong Hyon Cho; Srinivas Devadas

As we enter an era of exascale multicores, the question of efficiently supporting a shared memory model has become of paramount importance. On the one hand, programmers demand the convenience of coherent shared memory; on the other, growing core counts place higher demands on the memory subsystem and increasing on-chip distances mean that interconnect delays are becoming a significant part of memory access latencies. In this article, we first review the traditional techniques for providing a shared memory abstraction at the hardware level in multicore systems. We describe two new schemes that guarantee coherent shared memory without the complexity and overheads of a cache coherence protocol, namely execution migration and library cache coherence. We compare these approaches using an analytical model based on average memory latency, and give intuition for the strengths and weaknesses of each. Finally, we describe hybrid schemes that combine the strengths of different schemes.

networks on chips | 2011

Deadlock-free fine-grained thread migration

Myong Hyon Cho; Keun Sup Shim; Mieszko Lis; Omer Khan; Srinivas Devadas

Several recent studies have proposed fine-grained, hardware-level thread migration in multicores as a solution to power, reliability, and memory coherence problems. The need for fast thread migration has been well documented, however, a fast, deadlock-free migration protocol is sorely lacking: existing solutions either deadlock or are too slow and cumbersome to ensure performance with frequent, fine-grained thread migrations. In this study, we introduce the Exclusive Native Context (ENC) protocol, a general, provably deadlock-free migration protocol for instruction-level thread migration architectures. Simple to implement, ENC does not require additional hardware beyond common migration-based architectures. Our evaluation using synthetic migrations and the SPLASH-2 application suite shows that ENC offers performance within 11.7% of an idealized deadlock-free migration protocol with infinite resources.

network on chip architectures | 2009

Path-based, randomized, oblivious, minimal routing

Myong Hyon Cho; Mieszko Lis; Keun Sup Shim; Michel A. Kinsy; Srinivas Devadas

Path-based, Randomized, Oblivious, Minimal routing (PROM) is a family of oblivious, minimal, path-diverse routing algorithms especially suitable for Network-on-Chip applications with n×n mesh geometry. Rather than choosing among all possible paths at the source node, PROM algorithms achieve the same effect progressively through efficient, local randomized decisions at each hop. Routing is deadlock-free in all PROM algorithms when the routers have at least two virtual channels. While the approach we present can be viewed as a generalization of both ROMM and O1TURN routing, it combines the low hardware cost of O1TURN with the routing diversity offered by the most complex n-phase ROMM schemes. As all PROM algorithms employ the same hardware, a wide range of routing behaviors, from O1TURN-equivalent to uniformly path-diverse, can be effected by adjusting just one parameter, even while the network is live and continues to forward packets. Detailed simulation on a set of benchmarks indicates that, on equivalent hardware, the performance of PROM algorithms compares favorably to existing oblivious routing algorithms, including dimension-ordered routing, two-phase ROMM, and O1TURN.

IEEE Computer Architecture Letters | 2014

Thread Migration Prediction for Distributed Shared Caches

Keun Sup Shim; Mieszko Lis; Omer Khan; Srinivas Devadas

Chip-multiprocessors (CMPs) have become the mainstream parallel architecture in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Access (NUCA) design, where on-chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware-level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 24% on average over the shared-NUCA design that only uses remote accesses.

Parallel and distributed computing and systems | 2011

DIRECTORYLESS SHARED MEMORY COHERENCE USING EXECUTION MIGRATION

Mieszko Lis; Keun Sup Shim; Myong Hyon Cho; Omer Khan; Srinivas Devadas

We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance. EM area and energy consumption are very competitive, and, on the average, it outperforms a directory-based MOESI baseline by 1.3 and a traditional S-NUCA design by 1.2 . We argue that with EM scaling performance has much lower cost and design complexity than in directorybased coherence and traditional NUCA architectures: by merely scaling network bandwidth from 256 to 512 bit flits, the performance of our architecture improves by an additional 13%, while the baselines show negligible improvement.

IEEE Transactions on Computers | 2013

Optimal and Heuristic Application-Aware Oblivious Routing

Michel A. Kinsy; Myong Hyon Cho; Keun Sup Shim; Mieszko Lis; Gookwon Edward Suh; Srinivas Devadas

Conventional oblivious routing algorithms do not take into account resource requirements (e.g., bandwidth, latency) of various flows in a given application. As they are not aware of flow demands that are specific to the application, network resources can be poorly utilized and cause serious local congestion. Also, flows, or packets, may share virtual channels in an undetermined way; the effects of head-of-line blocking may result in throughput degradation. In this paper, we present a framework for application-aware routing that assures deadlock freedom under one or more virtual channels by forcing routes to conform to an acyclic channel dependence graph. In addition, we present methods to statically and efficiently allocate virtual channels to flows or packets, under oblivious routing, when there are two or more virtual channels per link. Using the application-aware routing framework, we develop and evaluate a bandwidth-sensitive oblivious routing scheme that statically determines routes considering an applications communication characteristics. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof. Our results show that it is possible to achieve better performance than traditional deterministic and oblivious routing schemes on popular synthetic benchmarks using our bandwidth-sensitive approach. We also show that, when oblivious routing is used and there are more flows than virtual channels per link, the static assignment of virtual channels to flows can help mitigate the effects of head-of-line blocking, which may impede packets that are dynamically competing for virtual channels. We experimentally explore the performance tradeoffs of static and dynamic virtual channel allocation on bandwidth-sensitive and traditional oblivious routing methods.

Explore More