Ken Mai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ken Mai is active.

Explore More

Publication

Featured researches published by Ken Mai.

international symposium on computer architecture | 2000

Smart Memories: a modular reconfigurable architecture

Ken Mai; Tim Paaske; Nuwan Jayasena; Ron Ho; William J. Dally; Mark Horowitz

Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these conflicting requirements, we propose a modular reconfigurable architecture called Smart Memories, targeted at computing needs in the 0.1μ technology generation. A Smart Memories chip is made up of many processing tiles, each containing local memory, local interconnect, and a processor core. For efficient computation under a wide class of possible applications, the memories, the wires, and the computational model can all be altered to match the applications. To show the applicability of this design, two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, are mapped onto the Smart Memories computing substrate. Simulations of the mappings show that the Smart Memories architecture can successfully map these architectures with only modest performance degradation.

international symposium on microarchitecture | 2007

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe

In deep sub-micron ICs, growing amounts of on-die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory protection techniques can neither detect nor correct large-scale multi-bit errors without incurring large performance, area, and power overheads. We propose two-dimensional (2D) error coding in embedded memories, a scalable multi-bit error protection technique to improve memory reliability and yield. The key innovation is the use of vertical error coding across words that is used only for error correction in combination with conventional per-word horizontal error coding. We evaluate this scheme in the cache hierarchies of two representative chip multiprocessor designs and show that 2D error coding can correct clustered errors up to 32times32 bits with significantly smaller performance, area, and power overheads than conventional techniques.

Proceedings of the IEEE | 2008

Digital Circuit Design Challenges and Opportunities in the Era of Nanoscale CMOS

Benton H. Calhoun; Yu Cao; Xin Li; Ken Mai; Lawrence T. Pileggi; Rob A. Rutenbar; Kenneth L. Shepard

Well-designed circuits are one key ldquoinsulatingrdquo layer between the increasingly unruly behavior of scaled complementary metal-oxide-semiconductor devices and the systems we seek to construct from them. As we move forward into the nanoscale regime, circuit design is burdened to ldquohiderdquo more of the problems intrinsic to deeply scaled devices. How this is being accomplished is the subject of this paper. We discuss new techniques for logic circuits and interconnect, for memory, and for clock and power distribution. We survey work to build accurate simulation models for nanoscale devices. We discuss the unique problems posed by nanoscale lithography and the role of geometrically regular circuits as one promising solution. Finally, we look at recent computer-aided design efforts in modeling, analysis, and optimization for nanoscale designs with ever increasing amounts of statistical variation.

symposium on vlsi circuits | 2003

Efficient on-chip global interconnects

Ron Ho; Ken Mai; Mark Horowitz

We present circuits for a high-efficiency low-swing interconnect scheme suitable for the Smart Memories reconfigurable architecture. By using a separate supply, global clocking, and differential signaling, we reduce design complexity; and by using overdrive circuits, equalization techniques, and sense-amplifiers we retain high performance. A testchip built in a 1.8 V 0.18-/spl mu/m technology consumed <1pJ/bit for a 10 mm bus at 1 GHz, a power savings over full-swing signaling of up to 10 x, and demonstrated amplifier input offset voltages of under 100 mV.

IEEE Journal of Solid-state Circuits | 1998

Low-power SRAM design using half-swing pulse-mode techniques

Ken Mai; Toshihiko Mori; Bharadwaj Amrutur; Ron Ho; Bennett Wilburn; Mark Horowitz; Isao Fukushi; T. Izawa; Shin Mitarai

This paper describes a half-swing pulse-mode gate family that uses reduced input signal swing without sacrificing performance. These gates are well suited for decreasing the power in SRAM decoders and write circuits by reducing the signal swing on high-capacitance predecode lines, write bus lines, and bit lines. Charge recycling between positive and negative half-swing pulses further reduces the power dissipation. These techniques are demonstrated in a 2-K/spl times/16-b SRAM fabricated in a 0.25-/spl mu/m dual-V/sub t/ CMOS technology that dissipates 0.9 mW operating at 1 V, 100 MHz, and room temperature. On-chip voltage samplers were used to probe internal nodes.

symposium on vlsi circuits | 1998

Applications of on-chip samplers for test and measurement of integrated circuits

Ron Ho; Bharadwaj Amrutur; Ken Mai; Bennett Wilburn; Toshihiko Mori; Mark Horowitz

Displaying the real-time behavior of critical signals on VLSI chips is difficult and can require expensive test equipment. We present a simple sampling technique to display the analog waveforms of high bandwidth on-chip signals on a laboratory oscilloscope. It is based on the subsampling of periodic signals. This circuit was used to verify the operation of a recent low-power SRAM design.

field programmable gate arrays | 2011

CoRAM: an in-fabric memory architecture for FPGA-based computing

Eric S. Chung; James C. Hoe; Ken Mai

FPGAs have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, FPGAs have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to FPGA-based computing today is the FPGAs lack of a common, scalable memory architecture. When developing applications for FPGAs, designers are often directly responsible for crafting the application-specific infrastructure logic that manages and transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will frequently lock a design to a particular FPGA product line, hindering scalability and portability. We propose a new FPGA memory architecture called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the external memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify development and to improve an applications portability and scalability.

ACM Transactions on Reconfigurable Technology and Systems | 2009

ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs

Eric S. Chung; Michael K. Papamichael; Eriko Nurvitadhi; James C. Hoe; Ken Mai; Babak Falsafi

Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the ProtoFlex simulation architecture, which uses FPGAs to accelerate full-system multiprocessor simulation and to facilitate high-performance instrumentation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, ProtoFlex virtualizes the execution of many logical processors onto a consolidated number of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance at a large savings in complexity. Further, to achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the ProtoFlex simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server, hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 38x speedup (and as high as 49×) over comparable software simulation across a suite of applications, including OLTP on a commercial database server. We also demonstrate the advantages of minimal-overhead FPGA-accelerated instrumentation through a CMP cache simulation technique that runs orders-of-magnitude faster than software.

field programmable gate arrays | 2008

A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs

Eric S. Chung; Eriko Nurvitadhi; James C. Hoe; Babak Falsafi; Ken Mai

Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or more. To overcome this bottleneck, we propose the PROTOFLEX simulation architecture, which uses FPGAs to accelerate simulation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, PROTOFLEX reduces complexity by virtualizing the execution of many logical processors onto a consolidated set of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance. To achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system We have created a first instance of the PROTOFLEX simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 39x speedup (and as high as 49x) over comparable software simulation across a suite of applications, including OLTP on a commercial database server.

hardware oriented security and trust | 2012

Reliability enhancement of bi-stable PUFs in 65nm bulk CMOS

Mudit Bhargava; Cagla Cakir; Ken Mai

We demonstrate the efficacy and associated costs of three reliability enhancing techniques for bi-stable PUF designs (SRAM and sense amplifier-based) - directed accelerated aging, multiple evaluations, and activation control. Measured results from a 65nm bulk CMOS full custom PUF testchip demonstrate that these technique are able to reduce the percentage of unreliable bits by up to 40%, 83%, and 71% respectively.

Explore More