Jee Ho Ryoo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jee Ho Ryoo is active.

Explore More

Publication

Featured researches published by Jee Ho Ryoo.

ieee international conference on high performance computing data and analytics | 2012

Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

Jinsuk Chung; Ikhwan Lee; Michael B. Sullivan; Jee Ho Ryoo; Dong Wan Kim; Doe Hyun Yoon; Larry Kaplan; Mattan Erez

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration, and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.

ieee international conference on high performance computing data and analytics | 2015

Data partitioning strategies for graph workloads on heterogeneous clusters

Michael LeBeane; Shuang Song; Reena Panda; Jee Ho Ryoo; Lizy Kurian John

Large scale graph analytics are an important class of problem in the modern data center. However, while data centers are trending towards a large number of heterogeneous processing nodes, graph analytics frameworks still operate under the assumption of uniform compute resources. In this paper, we develop heterogeneity-aware data ingress strategies for graph analytics workloads using the popular PowerGraph framework. We illustrate how simple estimates of relative node computational throughput can guide heterogeneity-aware data partitioning algorithms to provide balanced graph cutting decisions. Our work enhances five online data ingress strategies from a variety of sources to optimize application execution for throughput differences in heterogeneous data centers. The proposed partitioning algorithms improve the runtime of several popular machine learning and data mining applications by as much as a 65% and on average by 32% as compared to the default, balanced partitioning approaches.

symposium on computer architecture and high performance computing | 2015

Performance Characterization of Modern Databases on Out-of-Order CPUs

Reena Panda; Christopher Erb; Michael LeBeane; Jee Ho Ryoo; Lizy Kurian John

Big data revolution has created an unprecedented demand for intelligent data management solutions on a large scale. While data management has traditionally been used as a synonym for relational data processing, in recent years a new group popularly known as NoSQL databases have emerged as a competitive alternative. There is a pressing need to gain greater understanding of the characteristics of modern databases to architect targeted computers. In this paper, we investigate four popular NoSQL/SQL-style databases and evaluate their hardware performance on modern computer systems. Based on data collected from real hardware, we evaluate how efficiently modern databases utilize the underlying systems and make several recommendations to improve their performance efficiency. We observe that performance of modern databases is severely limited by poor cache/memory performance. Nonetheless, we demonstrate that dynamic execution techniques are still effective in hiding a significant fraction of the stalls, thereby improving performance. We further show that NoSQL databases suffer from greater performance inefficiencies than their SQL counterparts. SQL databases outperform NoSQL databases for most operations and are beaten by NoSQL databases only in a few cases. NoSQL databases provide a promising competitive alternative to SQL-style databases, however, they are yet to be optimized to fully reach the performance of contemporary SQL systems. We also show that significant diversity exists among different database implementations and big-data benchmark designers can leverage our analysis to incorporate representative workloads to encapsulate the full spectrum of data-serving applications. In this paper, we also compare data-serving applications with other popular benchmarks such as SPEC CPU2006 and SPECjbb2005.

international symposium on low power electronics and design | 2015

PowerTrain: A learning-based calibration of McPAT power models

Wooseok Lee; Youngchun Kim; Jee Ho Ryoo; Dam Sunwoo; Andreas Gerstlauer; Lizy Kurian John

As research on improving energy efficiency becomes prevalent, the necessity of a tool to accurately estimate power is increasing. Among various tools proposed, McPAT has gained some popularity due to its easy-to-use analytical power models. However, McPATs prediction has several limitations. Although under- or over-estimated power from unmodeled and mis-modeled parts offset each other, it still incorporates errors in each block. Moreover, the lack of awareness to the implementation details exacerbates the prediction inaccuracies. To alleviate this problem, we propose a new methodology to train McPAT towards precise processor power prediction using power measurements from real hardware. This calibration enables McPATs power to fit to the target processor power. Once we adjusted the power consumption of each block to best match those in the target processor, our trained McPAT delivered more precise power estimation. We calibrated the outputs of McPAT against a Cortex-A15 within a Samsung Exynos 5422 SoC. We observe that our methodology successfully reduces the errors, particularly for workloads with fluctuating power behaviors. The results show that the mean percentage error and the mean percentage absolute error of the calibrated power against real hardware are 2.04 percent and 4.37 percent, respectively.

international conference on parallel processing | 2013

Flow Migration on Multicore Network Processors: Load Balancing While Minimizing Packet Reordering

Muhammad Iqbal; Jim Holt; Jee Ho Ryoo; Lizy Kurian John; Gustavo de Veciance

With ever increasing network traffic rates, multicore architectures for network processors have successfully provided performance improvements through high parallelism. However, naively allocating the network traffic to multiple cores without considering diversified applications and flow locality results in issues such as packet reordering, load imbalance and inefficient cache usage. Consequently, these issues degrade the performance of latency sensitive network processors by dropping packets or delivering packets out of order. In this paper, we propose a packet scheduling scheme that considers the multiple dimensions of locality to improve the throughput of a network processor while minimizing out of order packets. Our scheduling policy tries to maintain packet order by maintaining the flow locality, minimizes the migration of flows from one core to another by identifying the aggressive flows, and partitions the cores among multiple services to gain instruction cache locality. The scheduler uses a novel low cost two-level caching scheme to identify top aggressive flows. Our light weight hardware implementation shows improvement of 60% in the number of packets dropped and 80% improvement in the out-of-order packet deliveries over previously proposed techniques.

international symposium on computer architecture | 2017

Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB

Jee Ho Ryoo; Nagendra Gulur; Shuang Song; Lizy Kurian John

With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the radix-4 type of page tables used in x86 architectures, a TLB-miss necessitates up to 24 memory references for one guest to host translation. While dedicated page walk caches and such recent enhancements eliminate many of these memory references, our measurements on the Intel Skylake processors indicate that many programs in virtualized mode of execution still spend hundreds of cycles for translations that do not hit in the TLBs. This paper presents an innovative scheme to reduce the cost of address translations by using a very large Translation Lookaside Buffer that is part of memory, the POM-TLB. In the POM-TLB, only one access is required instead of up to 24 accesses required in commonly used 2D walks with radix-4 type of page tables. Even if many of the 24 accesses may hit in the page walk caches, the aggregated cost of the many hits plus the overhead of occasional misses from page walk caches still exceeds the cost of one access to the POM-TLB. Since the POM-TLB is part of the memory space, TLB entries (as opposed to multiple page table entries) can be cached in large L2 and L3 data caches, yielding significant benefits. Through detailed evaluation running SPEC, PARSEC and graph workloads, we demonstrate that the proposed POM-TLB improves performance by approximately 10% on average. The improvement is more than 16% for 5 of the benchmarks. It is further seen that a POM-TLB of 16MB size can eliminate nearly all TLB misses in 8-core systems.

international conference on parallel architectures and compilation techniques | 2016

POSTER: SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization

Jee Ho Ryoo; Mitesh R. Meswani; Reena Panda; Lizy Kurian John

With current DRAM technology reaching its limit, emerging heterogeneous memory systems have become attractive to keep the memory performance scaling. This paper argues for using a small, fast memory closer to the processor as part of a flat address space where the memory system is composed of two or more memory types. OS-transparent management of such memory has been proposed in prior works such as CAMEO and Part of Memory (PoM) work. Data migration is typically handled either at coarse granularity with high bandwidth overheads (as in PoM) or atne granularity with low hit rate (as in CAMEO). Prior work uses restricted address mapping from only congruence groups in order to simplify the mapping. At any time, only one page (block) from a congruence group is resident in the fast memory. In this paper, we present a at address space organization called SILC-FM that uses large granularity but allows subblocks from two pages to coexist in an interleaved fashion in fast memory. Data movement is done at subblocked granularity, avoiding fetching of useless subblocks and consuming less bandwidth compared to migrating the entire large block. SILC-FM can get more spatial locality hits than CAMEO and PoM due to page-level operation and interleaving blocks respectively. The interleaved subblock placement improves performance by 55% on average over a static placement scheme without data migration. We also selectively lock hot blocks to prevent them from being involved in the hardware swapping operations. Additional features such as locking, associativity and bandwidth balancing improve performance by 11%, 8%, and 8% respectively, resulting in a total of 82% performance improvement over no migration static placement scheme. Compared to the best state-of-the-art scheme, SILC-FM gets performance improvement of 36% with 13% energy savings.

international conference on parallel processing | 2016

Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters

Shuang Song; Meng Li; Xinnian Zheng; Michael LeBeane; Jee Ho Ryoo; Reena Panda; Andreas Gerstlauer; Lizy Kurian John

Big data decision-making techniques take advantage of large-scale data to extract important insights from them. One of the most important classes of such techniques falls in the domain of graph applications, where data segments and their inherent relationships are represented as vertices and edges. Efficiently processing large-scale graphs involves many subtle tradeoffs and is still regarded as an open-ended problem. Furthermore, as modern data centers move towards increased heterogeneity, the traditional assumption of homogeneous environments in current graph processing frameworks is no longer valid. Prior work estimates the graph processing power of heterogeneous machines by simply reading hardware configurations, which leads to suboptimal load balancing. In this paper, we propose a profiling methodology leveraging synthetic graphs for capturing a nodes computational capability and guiding graph partitioning in heterogeneous environments with minimal overheads. We show that by sampling the execution of applications on synthetic graphs following a power-law distribution, the computing capabilities of heterogeneous clusters can be captured accurately (<;10% error). Our proxy-guided graph processing system results in a maximum speedup of 1.84x and 1.45x over a default system and prior work, respectively. On average, it achieves 17.9% performance improvement and 14.6% energy reduction as compared to prior heterogeneity-aware work.

symposium on computer architecture and high performance computing | 2015

i-MIRROR: A Software Managed Die-Stacked DRAM-Based Memory Subsystem

Jee Ho Ryoo; Karthik Ganesan; Yao-Min Chen; Lizy Kurian John

This paper presents an operating system managed die-stacked DRAM called i-MIRROR that mirrors high locality pages from off-chip DRAM. Optimizing the problems of reducing cache tag area, reducing transfer bandwidth and improving hit latency altogether while using die-stacked DRAM as hardware cache is extremely challenging. In this paper, we show that performance and energy efficiency can be obtained by software management of die-stacked DRAM, which eliminates the need for tags, the source of aforementioned problems. In the proposed scheme, the operating system loads pages from disks to die-stacked DRAM on a page fault at the same time as they are loaded to off-chip DRAM. Our scheme maintains the pages in off-chip and die-stacked DRAM in a synchronized/mirrored state by exploiting the parallel loading capability to die-stacked and off-chip DRAM from the disk. This eliminates the need for physical page movement to the slower off-chip DRAM upon eviction from die-stacked DRAM. Requests for pages that got evicted from die-stacked DRAM are simply serviced by the slower off-chip DRAM to prevent frequent data movements of large pages and thrashing between conflicting pages. The operating system periodically monitors the usage of the pages in off-chip DRAM and promotes high locality pages to die-stacked DRAM. Our evaluations show that the proposed hardware-assisted software-managed i-MIRROR scheme achieves an IPC improvement of 13% while consuming 6% less energy than prior state-of-the-art die-stacked caching schemes and 79% improvement in terms of IPC and 72% in terms of energy savings over systems without die-stacked DRAM support.

symposium on computer architecture and high performance computing | 2015

Watt Watcher: Fine-Grained Power Estimation for Emerging Workloads

Michael LeBeane; Jee Ho Ryoo; Reena Panda; Lizy Kurian John

Extensive research has focused on estimating power to guide advances in power management schemes, thermal hot spots, and voltage noise. However, simulated power models are slow and struggle with deep software stacks, while direct measurements are typically coarse-grained. This paper introduces Watt Watcher, a multicore power measurement framework that offers fine-grained functional unit breakdowns. Watt Watcher operates by passing event counts and a hardware descriptor file into configurable back-end power models based on McPAT. Researchers and vendors can add other processors to our tool by mapping to the Watt Watcher interface. We show that Watt Watcher, when calibrated, has a MAPE (mean absolute percentage error) of 2.67% aggregated over all benchmarks when compared to measured power consumption on SPEC CPU 2006 and multithreaded PARSEC benchmarks across three different machines of various form factors and manufacturing processes. We present two use cases showing how Watt Watcher can derive insights that are difficult to obtain through other measurement infrastructures. Additionally, we illustrate how Watt Watcher can be used to provide insights into challenging big data and cloud workloads on a server CPU. Through the use of Watt Watcher, it is possible to obtain a detailed power breakdown on real hardware without vendor proprietary models or hardware instrumentation.

Explore More