Is this you? Create Your Porfile

Yunsup Lee

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yunsup Lee is active.

Explore More

Publication

Featured researches published by Yunsup Lee.

Nature | 2015

Single-chip microprocessor that communicates directly using light

Chen Sun; Mark T. Wade; Yunsup Lee; Jason S. Orcutt; Luca Alloatti; Michael Georgas; Andrew Waterman; Jeffrey M. Shainline; Rimas Avizienis; Sen Lin; Benjamin R. Moss; Rajesh Kumar; Fabio Pavanello; Amir H. Atabaki; Henry Cook; Albert J. Ou; Jonathan Leu; Yu-Hsin Chen; Krste Asanovic; Rajeev J. Ram; Miloš A. Popović; Vladimir Stojanovic

Data transport across short electrical wires is limited by both bandwidth and power density, which creates a performance bottleneck for semiconductor microchips in modern computer systems—from mobile phones to large-scale data centres. These limitations can be overcome by using optical communications based on chip-scale electronic–photonic systems enabled by silicon-based nanophotonic devices8. However, combining electronics and photonics on the same chip has proved challenging, owing to microchip manufacturing conflicts between electronics and photonics. Consequently, current electronic–photonic chips are limited to niche manufacturing processes and include only a few optical devices alongside simple circuits. Here we report an electronic–photonic system on a single chip integrating over 70 million transistors and 850 photonic components that work together to provide logic, memory, and interconnect functions. This system is a realization of a microprocessor that uses on-chip photonic devices to directly communicate with other chips using light. To integrate electronics and photonics at the scale of a microprocessor chip, we adopt a ‘zero-change’ approach to the integration of photonics. Instead of developing a custom process to enable the fabrication of photonics, which would complicate or eliminate the possibility of integration with state-of-the-art transistors at large scale and at high yield, we design optical devices using a standard microelectronics foundry process that is used for modern microprocessors. This demonstration could represent the beginning of an era of chip-scale electronic–photonic systems with the potential to transform computing system architectures, enabling more powerful computers, from network infrastructure to data centres and supercomputers.

parallel computing | 2012

PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Andreas Klöckner; Nicolas Pinto; Yunsup Lee; Bryan Catanzaro; Paul Ivanov; Ahmed R. Fasih

High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that supports this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.

design automation conference | 2012

Chisel: constructing hardware in a Scala embedded language

Jonathan Bachrach; Huy Vo; Brian C. Richards; Yunsup Lee; Andrew Waterman; Rimas Avizienis; John Wawrzynek; Krste Asanovic

In this paper we introduce Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. By embedding Chisel in the Scala programming language, we raise the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. Chisel can generate a high-speed C++-based cycle-accurate software simulator, or low-level Verilog designed to map to either FPGAs or to a standard ASIC flow for synthesis. This paper presents Chisel, its embedding in Scala, hardware examples, and results for C++ simulation, Verilog emulation and ASIC synthesis.

international conference on computer vision | 2009

Efficient, high-quality image contour detection

Bryan Catanzaro; Bor-Yiing Su; Narayanan Sundaram; Yunsup Lee; Mark Murphy; Kurt Keutzer

Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, which limits their applicability, even for offline batch processing. In this work, we examine efficient parallel algorithms for performing image contour detection, with particular attention paid to local image analysis as well as the generalized eigensolver used in Normalized Cuts. Combining these algorithms into a contour detector, along with careful implementation on highly parallel, commodity processors from Nvidia, our contour detector provides uncompromised contour accuracy, with an F-metric of 0.70 on the Berkeley Segmentation Dataset. Runtime is reduced from 4 minutes to 1.8 seconds. The efficiency gains we realize enable high-quality image contour detection on much larger images than previously practical, and the algorithms we propose are applicable to several image segmentation approaches. Efficient, scalable, yet highly accurate image contour detection will facilitate increased performance in many computer vision applications.

design automation conference | 2010

RAMP gold: an FPGA-based architecture simulator for multiprocessors

Zhangxi Tan; Andrew Waterman; Rimas Avizienis; Yunsup Lee; Henry Cook; David A. Patterson; Krste Asanovica

We present RAMP Gold, an economical FPGA-based architecture simulator that allows rapid early design-space exploration of manycore systems. The RAMP Gold prototype is a high-throughput, cycle-accurate full-system simulator that runs on a single Xilinx Virtex-5 FPGA board, and which simulates a 64-core shared-memory target machine capable of booting real operating systems. To improve FPGA implementation efficiency, functionality and timing are modeled separately and host multithreading is used in both models. We evaluate the prototypes performance using a modern parallel benchmark suite running on our manycore research operating system, achieving two orders of magnitude speedup compared to a widely-used software-based architecture simulator.

symposium on code generation and optimization | 2013

Convergence and scalarization for data-parallel architectures

Krste Asanovic; Stephen W. Keckler; Yunsup Lee; Ronny Meir Krashinsky; Vinod Grover

Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31% memory address counts by 47%, and data access counts by 38%.

european solid state circuits conference | 2014

A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators

Yunsup Lee; Andrew Waterman; Rimas Avizienis; Henry Cook; Chen Sun; Vladimir Stojanovic; Krste Asanovic

A 64-bit dual-core RISC-V processor with vector accelerators has been fabricated in a 45nm SOI process. This is the first dual-core processor to implement the open-source RISC-V ISA designed at the University of California, Berkeley. In a standard 40nm process, the RISC-V scalar core scores 10% higher in DMIPS/MHz than the Cortex-A5, ARMs comparable single-issue in-order scalar core, and is 49% more area-efficient. To demonstrate the extensibility of the RISC-V ISA, we integrate a custom vector accelerator alongside each single-issue in-order scalar core. The vector accelerator is 1.8× more energy-efficient than the IBM Blue Gene/Q processor, and 2.6× more than the IBM Cell processor, both fabricated in the same process. The dual-core RISC-V processor achieves maximum clock frequency of 1.3GHz at 1.2V and peak energy efficiency of 16.7 double-precision GFLOPS/W at 0.65V with an area of 3mm2.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2012

SRAM Assist Techniques for Operation in a Wide Voltage Range in 28-nm CMOS

Brian Zimmer; Seng Oon Toh; Huy Vo; Yunsup Lee; Olivier Thomas; Krste Asanovic; Borivoje Nikolic

Reducing static random-access memory (SRAM) operational voltage (Vmin) can greatly improve energy efficiency, yet SRAM Vmin does not scale with technology due to increased process variability. Assist techniques have been shown to improve the operation of SRAM, but previous investigations of assist techniques at design time have either relied on static metrics that do not account for important transient effects or make specific assumptions about failure distributions. This paper uses importance sampling of dynamic failure metrics to quantify and analyze the effect of different assist techniques, array organization, and timing on Vmin at design time. This approach demonstrates that the most effective technique for reducing SRAM Vmin is the negative bitline write assist, resulting in a Vmin of 600 mV for a 28-nm LP process in the typical corner.

symposium on vlsi circuits | 2015

A RISC-V vector processor with tightly-integrated switched-capacitor DC-DC converters in 28nm FDSOI

Brian Zimmer; Yunsup Lee; Alberto Puggelli; Jaehwa Kwak; Ruzica Jevtic; Ben Keller; Stevo Bailey; Milovan Blagojevic; Pi-Feng Chiu; Hanh-Phuc Le; Po-Hung Chen; Nicholas Sutardja; Rimas Avizienis; Andrew Waterman; Brian C. Richards; Philippe Flatresse; Elad Alon; Krste Asanovic; Borivoje Nikolic

This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated non-interleaved switched-capacitor DCDC (SC-DCDC) converters and adaptive clocking that generates four on-chip voltages between 0.5V and 1V using only 1.0V core and 1.8V IO voltage inputs. The design pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80-86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices.

international symposium on computer architecture | 2015

Flexible software profiling of GPU architectures

Mark Stephenson; Siva Kumar Sastry Hari; Yunsup Lee; Eiman Ebrahimi; Daniel R. Johnson; David W. Nellans; Mike O'Connor; Stephen W. Keckler

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code “SASS” Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.

Explore More