Kevin T. Lim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin T. Lim is active.

Explore More

Publication

Featured researches published by Kevin T. Lim.

international symposium on microarchitecture | 2006

The M5 Simulator: Modeling Networked Systems

Nathan L. Binkert; Ronald G. Dreslinski; Lisa R. Hsu; Kevin T. Lim; Ali G. Saidi; Steven K. Reinhardt

The M5 simulator is developed specifically to enable research in TCP/IP networking. The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically. M5s usefulness as a general-purpose architecture simulator and its liberal open-source license has led to its adoption by several academic and commercial groups

international symposium on computer architecture | 2013

Thin servers with smart pipes: designing SoC accelerators for memcached

Kevin T. Lim; David Meisner; Ali G. Saidi; Parthasarathy Ranganathan; Thomas F. Wenisch

Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

international symposium on microarchitecture | 2013

Meet the walkers: accelerating index traversals for in-memory databases

Onur Kocberber; Boris Grot; Javier Picorel; Babak Falsafi; Kevin T. Lim; Parthasarathy Ranganathan

The explosive growth in digital data and its growing role in real-time decision support motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. These developments motivate the design of DBMS accelerators that (a) maximize utility by accelerating the dominant operations, and (b) provide flexibility in the choice of DBMS, data layout, and data types. We study data analytics workloads on contemporary in-memory databases and find hash index lookups to be the largest single contributor to the overall execution time. The critical path in hash index lookups consists of ALU-intensive key hashing followed by pointer chasing through a node list. Based on these observations, we introduce Widx, an on-chip accelerator for database hash index lookups, which achieves both high performance and flexibility by (1) decoupling key hashing from the list traversal, and (2) processing multiple keys in parallel on a set of programmable walker units. Widx reduces design cost and complexity through its tight integration with a conventional core, thus eliminating the need for a dedicated TLB and cache. An evaluation of Widx on a set of modern data analytics workloads (TPC-H, TPC-DS) using full-system simulation shows an average speedup of 3.1× over an aggressive OoO core on bulk hash table operations, while reducing the OoO core energy by 83%.

field programmable gate arrays | 2013

An FPGA memcached appliance

Sai Rahul Chalamalasetti; Kevin T. Lim; Mitch Wright; Alvin AuYoung; Parthasarathy Ranganathan; Martin Margala

Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGAs design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.

international symposium on microarchitecture | 2011

System-level integrated server architectures for scale-out datacenters

Sheng Li; Kevin T. Lim; Paolo Faraboschi; Jichuan Chang; Parthasarathy Ranganathan; Norman P. Jouppi

A System-on-Chip (SoC) integrates multiple discrete components into a single chip, for example by placing CPU cores, network interfaces and I/O controllers on the same die. While SoCs have dominated high-end embedded products for over a decade, system-level integration is a relatively new trend in servers, and is driven by the opportunity to lower cost (by reducing the number of discrete parts) and power (by reducing the pin crossings from the cores to the I/O). Today, the mounting cost pressures in scale-out datacenters demand technologies that can decrease the Total Cost of Ownership (TCO). At the same time, the diminshing return of dedicating the increasing number of available transistors to more cores and caches is creating a stronger case for SoC-based servers.

Proceedings of the 2nd Workshop on Architectures and Systems for Big Data | 2012

Workload diversity and dynamics in big data analytics: implications to system designers

Jichuan Chang; Kevin T. Lim; John Byrne; Laura Ramirez; Parthasarathy Ranganathan

The emergence of big data analytics and the need for cost/energy efficient IT infrastructure motivate a new focus on data-centric designs. In this paper, we aim to better understand the design implications of data analytics systems by quantifying workload requirements and runtime dynamics. We examine four workloads representing big data analytics trends for fast decisions, total integration, deep analysis and fresh insights: an archive store, a columnar database enhanced with table compression, an analytics engine with distributed R, and a transaction/analytics hybrid system. These appliations demonstrate diverse resource requirements both within and across workloads as well as load imbalance due to data skew. Our observations suggest several directions to design balanced data analytics systems, including tight integration of heterogeneous, active data stores, support for efficient communication and data-centric load balancing.

workshop on power aware computing and systems | 2011

Power-efficient networking for balanced system designs: early experiences with PCIe

John Byrne; Jichuan Chang; Kevin T. Lim; Laura Ramirez; Parthasarathy Ranganathan

Recent proposals using low-power processors and Flash-based storage can dramatically improve the energy-efficiency of compute and storage subsystems in data-centric computing. However, in a balanced system design, these changes call for matching improvement in the network subsystem as well. Conventional Ethernet-based networks are a potential energy-efficiency bottleneck due to the limited performance of gigabit Ethernet and the high power overhead of 10-gigbit Ethernet. In this paper, we evaluate the benefits of using an alternative, high-bandwidth, low-power, interconnect---PCIe---for power-efficient networking. Our experiments using PCIes Non-Transparent Bridging for data transfer demonstrate significant performance gains at lower power, leading to 60--124% better energy efficiency. Early experiences with PCIe clustering also point to several challenges of PCIe-based networks and new opportunities for low-latency power-efficient datacenter networking.

international symposium on microarchitecture | 2009

Server Designs for Warehouse-Computing Environments

Kevin T. Lim; Parthasarathy Ranganathan; Jichuan Chang; Chandrakant D. Patel; Trevor N. Mudge; Steven K. Reinhardt

The enormous scale of warehouse-computing environments leads to unique requirements in which cost and power figure prominently. Models and metrics quantifying these requirements, along with a benchmark suite to capture workload behavior, help identify bottlenecks and evaluate solutions. A holistic approach leads to a new system architecture incorporating volume non-server-class components in novel packaging solutions, with memory sharing and flash-based disk caching.

ACM Transactions on Architecture and Code Optimization | 2015

Buri: Scaling Big-Memory Computing with Hardware-Based Memory Expansion

Jishen Zhao; Sheng Li; Jichuan Chang; John Byrne; Laura Ramirez; Kevin T. Lim; Yuan Xie; Paolo Faraboschi

Motivated by the challenges of scaling up memory capacity and fully exploiting the benefits of memory compression, we propose Buri, a hardware-based memory compression scheme, which simultaneously achieves cost efficiency, high performance, and ease of adoption. Buri combines (1) a self-contained, ready-to-adopt hardware compression module, which manages metadata compression and memory allocation/relocation operations; (2) a set of hardware optimization mechanisms, which reduce the area and performance overheads in accommodating the address indirection required by memory compression; and (3) lightweight BIOS/OS extensions used to handle exceptions. Our evaluation with large memory workload traces shows that Buri can increase capacity by 70%, in addition to the compression ratio already provided by database software.

field programmable logic and applications | 2014

A scalable, high-performance customized priority queue

Muhuan Huang; Kevin T. Lim; Jason Cong

Priority queues are abstract data structures where each element is associated with a priority, and the highest priority element is always retrieved first from the queue. The data structure is widely used within databases, including the last stage of a merge-sort, forecasting read-ahead I/O to stream data for the merge-sort, and replacement selection sort. Typical software implementations use a balanced binary tree-based structure, providing O(log N) time for both enqueue and dequeue operations. To improve the performance, we propose several scalable and high-speed FPGA-based implementations of a priority queue. Our insight is that the above listed applications primarily use priority queues through “replace” operations, which remove the highest priority element and place a new element into the queue. Thus, our designs are customized for this operation, allowing for a simple and scalable architecture. We implement three priority queue designs, including use of a register-based array, register-based tree, and BRAM-based tree, which have different benefits and trade-offs of throughput, frequency, and maximum size. More importantly, all designs achieve O(1) time between replace operations. To incorporate the best aspects of our designs, we propose a Hybrid Priority Queue (H-PQ), which combines a register-based array with multiple BRAM-based trees. This design provides, on average, very fast access times to the top items in the queue (through the register-based array), while scaling to large priority queue sizes (through the BRAM-based trees). In our evaluations, we find that H-PQ achieves 4.3x speedup and 21.5x energy efficiency, compared with the Xeon CPU implementations.

Explore More