Yuho Jin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuho Jin is active.

Explore More

Publication

Featured researches published by Yuho Jin.

networks on chips | 2009

Recursive partitioning multicast: A bandwidth-efficient routing for Networks-on-Chip

Lei Wang; Yuho Jin; Hyungjun Kim; Eun Jung Kim

Chip Multi-processor (CMP) architectures have become mainstream for designing processors. With a large number of cores, Networks-on-Chip (NOCs) provide a scalable communication method for CMP architectures. NOCs must be carefully designed to meet constraints of power consumption and area, and provide ultra low latencies. Existing NOCs mostly use Dimension Order Routing (DOR) to determine the route taken by a packet in unicast traffic. However, with the development of diverse applications in CMPs, one-to-many (multicast) and one-to-all (broadcast) traffic are becoming more common. Current unicast routing cannot support multicast and broadcast traffic efficiently. In this paper, we propose Recursive Partitioning Multicast (RPM) routing and a detailed multicast wormhole router design for NOCs. RPM allows routers to select intermediate replication nodes based on the global distribution of destination nodes. This provides more path diversities, thus achieves more bandwidth-efficiency and finally improves the performance of the whole network. Our simulation results using a detailed cycle-accurate simulator show that compared with the most recent multicast scheme, RPM saves 25% of crossbar and link power, and 33% of link utilization with 50% network performance improvement. Also RPM is more scalable to large networks than the recently proposed VCTM.

high-performance computer architecture | 2007

A Domain-Specific On-Chip Network Design for Large Scale Cache Systems

Yuho Jin; Eun Jung Kim; Ki Hwan Yum

As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology

simulation tools and techniques for communications networks and system | 2008

A framework for end-to-end simulation of high-performance computing systems

Wolfgang E. Denzel; Jian Li; Peter Walker; Yuho Jin

We present an end-to-end simulation framework that is capable of simulating High-Performance Computing (HPC) systems with hundreds of thousands of interconnected processors. The tool applies discrete event simulation and is driven by real-world application traces. We refer to it as MARS (MPI Application Replay network Simulator). It maintains reasonable simulation details of both the processors in general and specifically the interconnection network. Among other things, it features several network topologies, flexible routing schemes, arbitrary application task placement, point-to-point statistics collection, and data visualization. With a few case studies, we demonstrate the usefulness of this tool for assisting high-level system design as well as for performance projection and application tuning of future HPC systems.

international symposium on microarchitecture | 2008

Adaptive data compression for high-performance low-power on-chip networks

Yuho Jin; Ki Hwan Yum; Eun Jung Kim

With the recent design shift towards increasing the number of processing elements in a chip, high-bandwidth support in on-chip interconnect is essential for low-latency communication. Much of the previous work has focused on router architectures and network topologies using wide/long channels. However, such solutions may result in a complicated router design and a high interconnect cost. In this paper, we exploit a table-based data compression technique, relying on value patterns in cache traffic. Compressing a large packet into a small one can increase the effective bandwidth of routers and links, while saving power due to reduced operations. The main challenges are providing a scalable implementation of tables and minimizing overhead of the compression latency. First, we propose a shared table scheme that needs one encoding and one decoding tables for each processing element, and a management protocol that does not require in-order delivery. Next, we present streamlined encoding that combines flit injection and encoding in a pipeline. Furthermore, data compression can be selectively applied to communication on congested paths only if compression improves performance. Simulation results in a 16-core CMP show that our compression method improves the packet latency by up to 44% with an average of 36% and reduces the network power consumption by 36% on average.

IEEE Transactions on Parallel and Distributed Systems | 2012

Communication-Aware Globally-Coordinated On-Chip Networks

Yuho Jin; Eun Jung Kim; Timothy Mark Pinkston

With continued Moores law scaling, multicore-based architectures are becoming the de facto design paradigm for achieving low-cost and performance/power-efficient processing systems through effective exploitation of available parallelism in software and hardware. A crucial subsystem within multicores is the on-chip interconnection network that orchestrates high-bandwidth, low-latency, and low-power communication of data. Much previous work has focused on improving the design of on-chip networks but without more fully taking into consideration the on-chip communication behavior of application workloads that can be exploited by the network design. A significant portion of this paper analyzes and models on-chip network traffic characteristics of representative application workloads. Leveraged by this, the notion of globally coordinated on-chip networks is proposed in which application communication behavior-captured by traffic profiling-is utilized in the design and configuration of on-chip networks so as to support prevailing traffic flows well, in a globally coordinated manner. This is applied to the design of a hybrid network consisting of a mesh augmented with configurable multidrop (bus-like) spanning channels that serve as express paths for traffic flows benefiting from them, according to the characterized traffic profile. Evaluations reveal that network latency and energy consumption for a 64-core system running OpenMP benchmarks can be improved on average by 15 and 27 percent, respectively, with globally coordinated on-chip networks.

international conference on parallel processing | 2005

Peak power control for a QoS capable on-chip network

Yuho Jin; Eun Jung Kim; Ki Hwan Yum

In recent years integrating multiprocessors in a single chip is emerging for supporting various scientific and commercial applications, with diverse demands to the underlying on-chip networks. Communication traffic of these applications makes routers greedy to acquire more power such that the total consumed power of the network may exceed the supplied power and cause reliability problems. To ensure high performance and power constraint satisfaction, the on-chip network must have a peak power control mechanism. In this paper, we propose a credit-based peak power control scheme to assure power consumption to be under the given peak power constraint, without performance degradation. The peak power control scheme efficiently regulates each flows injection rate at the sender to minimize performance penalty. We have two different throttling schemes for real-time traffic and best-effort traffic; a rate-based throttling and an energy-budget based throttling, respectively. The simulation results on mesh networks show that the credit-based peak power control effectively prevents performance degradation and meets the peak power constraint.

IEEE Transactions on Computers | 2010

Design and Analysis of On-Chip Networks for Large-Scale Cache Systems

Yuho Jin; Eun Jung Kim; Ki Hwan Yum

Switched networks have been adopted in on-chip communication for their scalability and efficient resource sharing. However, using a general network for a specific domain may result in unnecessary high cost and low performance when the interconnects are not optimized for the domain. Designing an optimal network for the specific domain is challenging because in-depth knowledge of interconnects and the application domain is required. Recently proposed Nonuniform Cache Architectures (NUCAs) use wormhole-routed 2D mesh networks in L2 caches. We observe that in NUCAs, network resources are underutilized with the considerable area cost (41 percent of cache) and the network delay is significantly large (63 percent of cache access time). Motivated by our observations, we investigate both router architecture and network topology for communication behaviors in large-scale cache systems. We present Fast-LRU replacement, where cache replacement overlaps with data request delivery. Next, we propose a deadlock-free XYX routing algorithm in a mesh network and present a new halo network topology to reduce the required links. Finally, we introduce a single-cycle multicast router that needs small modification of the unicast router design. Simulation results show that our design improves the average IPC by 38 percent over the mesh design with Multicast Promotion replacement and uses 12 percent of the interconnection area of the mesh network.

network on chip architectures | 2010

Thread criticality support in on-chip networks

Yuho Jin; Ruisheng Wang; Woojin Choi; Timothy Mark Pinkston

Multicore computing is becoming the mainstream approach in computer system designs to effectively use growing transistor budgets for harnessing performance and energy-efficiency. Increasing the parallelism with more cores requires careful management, allocation, or partitioning of shared resources to cope with varying resource demands from running threads. Predicting critical (or slowest) threads and accelerating execution of those threads can reduce execution time of parallel applications by balancing the execution of threads to synchronization points. The on-chip network is an increasingly important component that services communication of threads running on cores. As the communication latency of threads affects thread criticality, it should be considered and optimized. In this work, we explore thread criticality support in on-chip networks. We propose a flow control technique that reserves router resources to accelerate communication from critical threads. Furthermore, we present thread criticality support in arbiter designs. Our evaluation shows that implementing criticality awareness in an on-chip interconnect design reduces execution time by 22% and increases system throughput by 18% for a 64-core processor.

Journal of Parallel and Distributed Computing | 2010

Integration of admission, congestion, and peak power control in QoS-aware clusters

Ki Hwan Yum; Yuho Jin; Eun Jung Kim; Chita R. Das

Admission, congestion, and peak power control mechanisms are essential parts of a cluster network design for supporting integrated traffic. While an admission control algorithm helps in delivering the assured performance, a congestion control algorithm regulates traffic injection to avoid network saturation. Peak power control forces to meet pre-specified power constraints while maintaining the service quality by regulating the injection of packets. In this paper, we propose these control algorithms for clusters, which are increasingly being used in a diverse set of applications that require QoS guarantees. The uniqueness of our approach is that we develop these algorithms for wormhole-switched networks, which have been used in designing clusters. We use QoS-capable wormhole routers and QoS-capable network interface cards (NICs), referred to as Host Channel Adapters (HCAs) in InfiniBand(TM) Architecture (IBA), to evaluate the effectiveness of these algorithms. The admission control is applied at the HCAs and the routers, while the congestion control and the peak power control are deployed only at the HCAs. A mixed workload consisting of best-effort, real-time, and control traffic is used to investigate the effectiveness of the proposed schemes. Simulation results with a single router (8-port) cluster and a 2-D mesh network cluster indicate that the admission, congestion, and peak power control algorithms are quite effective in delivering the assured performance. The proposed credit-based congestion control algorithm is simple and practical in that it relies on hardware already available in the HCA/NIC to regulate traffic injection.

symposium on computer architecture and high performance computing | 2015

Intra-Clustering: Accelerating On-chip Communication for Data Parallel Architectures

Wen Yuan; Rahul Boyapati; Lei Wang; Hyunjun Jang; Yuho Jin; Ki Hwan Yum; Eun Jung Kim

Modern computation workloads contain abundant Data Level Parallelism (DLP), which requires specialized data parallel architectures, such as Graphics Processing Units (GPUs). With parallel programming models, such as CUDA and OpenCL, GPUs are easily to be programmed for non-graphics applications, and therefore become a cost effective approach for data parallel architectures. The large quantity of available parallelism places a heavy stress on the memory system as the limited number of pins confines the number of memory controllers on the chip. This creates a potential bottleneck for performance scalability of the GPUs. To accelerate communication with the memory system, we propose the Intra-Clustering on-chip network for data parallel architectures, which is built upon a traditional two-dimensional electrical mesh network with memory controllers connected through a nanophotonic ring and compute cores grouped into different clusters. Our evaluations with CUDA benchmarks show that the Intra-Clustering architecture can improve communication delay by an average of 17% (up to 32%) and IPC by an average of 5% (up to 11.5%).

Explore More