Derek Chiou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Derek Chiou is active.

Explore More

Publication

Featured researches published by Derek Chiou.

international symposium on computer architecture | 2014

A reconfigurable fabric for accelerating large-scale datacenter services

Andrew Putnam; Adrian M. Caulfield; Eric S. Chung; Derek Chiou; Kypros Constantinides; John Demme; Hadi Esmaeilzadeh; Jeremy Fowers; Gopi Prashanth Gopal; Jan Gray; Michael Haselman; Scott Hauck; Stephen Heil; Amir Hormati; Joo-Young Kim; Sitaram Lanka; James R. Larus; Eric C. Peterson; Simon Pope; Aaron Smith; Jason Thong; Phillip Yi Xiao; Doug Burger

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables. In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution-or, while maintaining equivalent throughput, reduces the tail latency by 29%.

international symposium on microarchitecture | 2007

FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators

Derek Chiou; Dam Sunwoo; Joonsoo Kim; Nikhil A. Patil; William H. Reinhart; Darrel Eric Johnson; Jebediah Keefe; Hari Angepat

Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of in- tervening misses at the last-level cache between the eviction of a particular block and its reuse can be very large, pre- venting traditional victim caching mechanisms from exploiting this repeating behavior. In this paper, we present Scavenger, a new architecture for last-level caches. Scavenger divides the total storage budget into a conventional cache and a novel victim file architecture, which employs a skewed Bloom filter in conjunction with a pipelined priority heap to identify and retain the blocks that most frequently missed in the conven- tional part of the cache in the recent past. When compared against a baseline configuration with a 1MB 8-way L2 cache, a Scavenger configuration with a 512kB 8-way conventional cache and a 512kB victim file achieves an IPC improvement of up to 63% and on average (geometric mean) 14.2% for nine memory-bound SPEC 2000 applications. On a larger set of sixteen SPEC 2000 applications, Scavenger achieves an aver- age speedup of 8%.This paper describes FAST, a novel simulation methodol- ogy that can produce simulators that (i) are orders of mag- nitude faster than comparable simulators, (ii) are cycle- accurate, (iii) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal simulation performance impact and (v) are capable of running current instruction sets such as x86. It achieves its capabilities by partitioning simulators into a speculative functional model component that simulates the instruction set architecture and a timing model com- ponent that predicts performance. The speculative func- tional model enables the simulator to be parallelized, im- plementing the timing model in FPGA hardware for speed and the functional model using a modified full-system simu- lators. We currently achieve an average simulation speed of 1.2MIPS running x86 applications on x86 Linux and Win- dows XP and expect to achieve 10MIPS over time. Such simulators are useful to virtually all computer system sim- ulator users ranging from architects, through RTL design- ers and verifiers to software developers. Sharing a common simulation/design infrastructure could foster better commu- nication between these groups, potentially resulting in bet- ter system designs.

international symposium on microarchitecture | 2007

RAMP: Research Accelerator for Multiple Processors

John Wawrzynek; David A. Patterson; Mark Oskin; Shih-Lien Lu; Christoforos E. Kozyrakis; James C. Hoe; Derek Chiou; Krste Asanovic

The RAMP projects goal is to enable the intensive, multidisciplinary innovation that the computing industry will need to tackle the problems of parallel processing. RAMP itself is an open-source, community-developed, FPGA-based emulator of parallel architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable design modules. three complete designs - for transactional memory, distributed systems, and distributed-shared memory - demonstrate the platforms potential.

design automation conference | 2000

Application-specific memory management for embedded systems using software-controlled caches

Derek Chiou; Prabhat Jain; Larry Rudolph; Srinivas Devadas

We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis. On-chip memory in the form of cache can be made to act like scratch-pad memory via a novel hardware mechanism, which we call column caching. Column caching enables dynamic cache partitioning in software, by mapping data regions to a specified sets of cache “columns” or “ways.” When a region of memory is exclusively mapped to an equivalent sized partition of cache, column caching provides the same functionality and predictability as a dedicated scratchpad memory for time-critical parts of a real-time application. The ratio between scratchpad size and cache size can be easily and quickly varied for each application, or each task within an application. Thus, software has much finer software control of on-chip memory, providing the ability to dynamically tradeoff performance for on-chip memory.

european conference on parallel processing | 1995

START-NG: Delivering Seamless Parallel Computing

Derek Chiou; Boon Seong Ang; Robert Greiner; Arvind; James C. Hoe; Michael J. Beckerle; James E. Hicks; G. Andrew Boughton

StarT-ng is a joint MIT-Motorola project to build a high-performance message passing machine from commercial systems. Each site of the machine consists of a PowerPC 620-based Motorola symmetric multiprocessor (SMP) running the AIX 4.1 operating system. Every processor is connected to a low-latency, high-bandwidth network that is directly accessible from user-level code. In addition to fast message passing capabilities, the machine has experimental support for cachecoherent shared memory across sites. When the machine requires memory to be kept globally coherent, one processor on each site is devoted to supporting shared memory. When globally coherent shared memory is not required, that processor can be used for normal computation tasks. StarT-ng will be delivered at about the time the base SMP is introduced into the marketplace. The ability to be both a collection of standard SMP and an aggressive message passing machine with coherent shared memory makes StarT-ng a good building block for incrementally expandable parallel machines.

ieee hot chips symposium | 2006

Research accelerator for multiple processors

David A. Patterson; Arvind; Krste Asanovic; Derek Chiou; James C. Hoe; Christos Kozyrakis; Shih-Lien Lu; Mark Oskin; Jan M. Rabaey; John Wawrzynek

This article consists of a collection of slides from the authors conference presentation on RAMP, or research acclerators for multiple processors. Some of the specific topics discussed include: system specifications and architecture; uniprocessor performance capabilities; RAMP hardware and description language features; RAMP applications development; storage capabilities; and future areas of technological development.

high-performance computer architecture | 2015

GPGPU performance and power estimation using machine learning

Gene Y. Wu; Joseph L. Greathouse; Alexander Lyashevsky; Nuwan Jayasena; Derek Chiou

Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), and available memory bandwidth. At many stages of the design process, it is important to estimate how application performance and power are impacted by these options. This paper describes a GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware. The model is trained on a collection of applications that are run at numerous different hardware configurations. From the measured performance and power data, the model learns how applications scale as the GPUs configuration is changed. Hardware performance counter values are then gathered when running a new application on a single GPU configuration. These dynamic counter values are fed into a neural network that predicts which scaling curve from the training data best represents this kernel. This scaling curve is then used to estimate the performance and power of the new application at different GPU configurations. Over an 8× range of the number of CUs, a 3.3× range of core frequencies, and a 2.9× range of memory bandwidth, our models performance and power estimates are accurate to within 15% and 10% of real hardware, respectively. This is comparable to the accuracy of cycle-level simulators. However, after an initial training phase, our model runs as fast as, or faster than the program running natively on real hardware.

Journal of Parallel and Distributed Computing | 1993

Performance studies of Id on the Monsoon dataflow system

James Edward Hicks; Derek Chiou; Boon Seong Ang; Arvind

Abstract In this paper, we examine the performance of Id, an implicitly parallel language, on Monsoon, an experimental dataflow machine. One of the precepts of our work is that the Id run-time system and compiled Id programs should run on any number of Monsoon processors without change. Our experiments running Id programs on Monsoon show that speedups of more than 7 are easily achieved on 8 processors for most of the applications that we studied. We explain the sources of overhead that limit the speedup of each of our benchmark programs. We also compare the performance of Id on a single Monsoon processor with C/Fortran on a DEC Station 5000 (MIPS R3000 processor), to establish a baseline for the efficiency of Id execution on Monsoon. We find that the execution of Id programs on one Monsoon processor takes up to three times as many cycles as the corresponding C or Fortran programs executing on a MIPS R3000 processor. We identify the sources of inefficiency on Monsoon and suggest improvements, where possible. In many cases, however, improving single processor performance will reduce parallel processor performance.

IEEE Computer Architecture Letters | 2014

An FPGA-based In-Line Accelerator for Memcached

Maysam Lavasani; Hari Angepat; Derek Chiou

We present a method for accelerating server applications using a hybrid CPU+FPGA architecture and demonstrate its advantages by accelerating Memcached, a distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from the network, avoiding the CPU in most cases. The accelerator is created by profiling the application to determine the most commonly executed trace of basic blocks which are then extracted. Traces are executed speculatively within the FPGA. If the control flow exits the trace prematurely, the side effects of the computation are rolled back and the request packet is passed to the CPU. When compared to the best reported software numbers, the Memcached accelerator is 9.15× more energy efficient for common case requests.

international conference on computer aided design | 2007

The FAST methodology for high-speed SoC/computer simulation

Derek Chiou; Dam Sunwoo; Joonsoo Kim; Nikhil A. Patil; William H. Reinhart; Darrel Eric Johnson; Zheng Xu

This paper describes the FAST methodology that enables a single FPGA to accelerate the performance of cycle-accurate computer system simulators modeling modern, realistic SoCs, embedded systems and standard desktop/laptop/server computer systems. The methodology partitions a simulator into (i) a functional model that simulates the functionality of the computer system and (ii) a predictive model that predicts performance and other metrics. The partitioning is crafted to map most of the parallel work onto a hardware-based predictive model, eliminating much of the complexity and difficulty of simulating parallel constructs on a sequential platform. FAST conventions and libraries have been designed to make creating, modifying, using and measuring such simulators straightforward. We describe a prototype FAST system: a full-system, RTL-level cycle-accurate-capable computer system simulator that executes the x86 ISA, boots unmodified Linux and executes unmodified x86 applications. The prototype runs two to three orders of magnitude faster than the fastest Intel and AMD RTL-level cycle-accurate x86 software-based simulators and about six to seven times faster than RTL simulation.

Explore More