Kermin Fleming | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kermin Fleming is active.

Explore More

Publication

Featured researches published by Kermin Fleming.

acm special interest group on data communication | 2012

Spinal codes

Jonathan Perry; Peter Anthony Iannucci; Kermin Fleming; Hari Balakrishnan; Devavrat Shah

Spinal codes are a new class of rateless codes that enable wireless networks to cope with time-varying channel conditions in a natural way, without requiring any explicit bit rate selection. The key idea in the code is the sequential application of a pseudo-random hash function to the message bits to produce a sequence of coded symbols for transmission. This encoding ensures that two input messages that differ in even one bit lead to very different coded sequences after the point at which they differ, providing good resilience to noise and bit errors. To decode spinal codes, this paper develops an approximate maximum-likelihood decoder, called the bubble decoder, which runs in time polynomial in the message size and achieves the Shannon capacity over both additive white Gaussian noise (AWGN) and binary symmetric channel (BSC) models. Experimental results obtained from a software implementation of a linear-time decoder show that spinal codes achieve higher throughput than fixed-rate LDPC codes, rateless Raptor codes, and the layered rateless coding approach of Strider, across a range of channel conditions and message sizes. An early hardware prototype that can decode at 10 Mbits/s in FPGA demonstrates that spinal codes are a practical construction.

field programmable gate arrays | 2011

Leap scratchpads: automatic memory and cache management for reconfigurable logic

Michael Adler; Kermin Fleming; Angshuman Parashar; Michael Pellauer; Joel S. Emer

Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management.

architectures for networking and communications systems | 2010

Airblue: a system for cross-layer wireless protocol development

Man Cheuk Ng; Kermin Fleming; Mythili Vutukuru; Samuel Gross; Arvind; Hari Balakrishnan

Over the past few years, researchers have developed many crosslayer wireless protocols to improve the performance of wireless networks. Experimental evaluations of these protocols have been carried out mostly using software-defined radios, which are typically two to three orders of magnitude slower than commodity hardware. FPGA-based platforms provide much better speeds but are quite difficult to modify because of the way high-speed designs are typically implemented. Experimenting with cross-layer protocols requires a flexible way to convey information beyond the data itself from lower to higher layers, and a way for higher layers to configure lower layers dynamically and within some latency bounds. One also needs to be able to modify a layers processing pipeline without triggering a cascade of changes. We have developed Airblue, an FPGA-based software radio platform, that has all these properties and runs at speeds comparable to commodity hardware. We discuss the design philosophy underlying Airblue that makes it relatively easy to modify it, and present early experimental results.

international conference on formal methods and models for co-design | 2007

Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA

Nirav Dave; Kermin Fleming; Myron King; Michael Pellauer; Muralidaran Vijayaraghavan

The first MEMOCODE hardware/software co-design contest posed the following problem: optimize matrix-matrix multiplication in such a way that it is split between the FPGA and PowerPC on a Xilinx Virtex IIPro30. In this paper we discuss our solution, which we implemented on a Xilinx XUP development board with 256 MB of DRAM. The design was done by the five authors over a span of approximately 3 weeks, though of the 15 possible man-weeks, about 9 were actually spent working on this problem. All hardware design was done using Blue-spec SystemVerilog (BSV), with the exception of an imported Verilog multiplication unit, necessary only due to the limitations of the Xilinx FPGA toolflow optimizations.

international conference on formal methods and models for co design | 2008

H.264 Decoder: A Case Study in Multiple Design Points

Kermin Fleming; Chun-Chieh Lin; Nirav Dave; Arvind; Gopal Raghavan; Jamey Hicks

H.264, a state-of-the-art video compression standard, is used across a range of products from cellphones to HDTV. These products have vastly different performance, power and cost requirements, necessitating different hardware-software solutions for H.264 decoding. We show that a design methodology and associated tools which support synthesis from high-level descriptions and which allow modular refinement throughout the design cycle, can share the majority of design effort across multiple design points. Using Bluespec SystemVerilog, we have created a variety of designs for the H.264 decoder tuned to support decoding at resolutions ranging from QCIF video (176 times 144 @ 15 frames/second) to 1080p video ((1280 times 1080)p @60 frames/second) in a 180 nm process. Some of these design points require major transformations of pipelining to increase performance or to reduce area. We also explore several common design issues surrounding memory structures, such as caches and on-chip vs. off-chip memories. We believe the design methodology used in this paper is directly applicable to many IP blocks involving algorithmic specifications. The same design capabilities also permit rapid microarchitecture exploration and changes in RTL late in the design process even in non-algorithmic IP blocks.

field programmable gate arrays | 2012

Leveraging latency-insensitivity to ease multiple FPGA design

Kermin Fleming; Michael Adler; Michael Pellauer; Angshuman Parashar; Arvind Mithal; Joel S. Emer

Traditionally, hardware designs partitioned across multiple FPGAs have had low performance due to the inefficiency of maintaining cycle-by-cycle timing among discrete FPGAs. In this paper, we present a mechanism by which complex designs may be efficiently and automatically partitioned among multiple FPGAs using explicitly programmed latency-insensitive links. We describe the automatic synthesis of an area efficient, high performance network for routing these inter-FPGA links. By mapping a diverse set of large research prototypes onto a multiple FPGA platform, we demonstrate that our tool obtains significant gains in design feasibility, compilation time, and even wall-clock performance.

field programmable gate arrays | 2015

MATCHUP: Memory Abstractions for Heap Manipulating Programs

Felix Winterstein; Kermin Fleming; Hsin-Jung Yang; Samuel Bayliss; George A. Constantinides

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.

IEEE Transactions on Consumer Electronics | 2011

Improving performance and lifetime of solid-state drives using hardware-accelerated compression

Sungjin Lee; Jihoon Park; Kermin Fleming; Arvind; Jihong Kim

The performance and lifetime of highperformance solid-state drives (SSDs) can be improved by data compression, which can reduce the amount of data physically transferred from/to flash memory. In this paper, we present our experience of building a high-performance solid-state drive using a hardware accelerated compression module called BlueZIP. In order to fully exploit the BlueZIP module, we devise a compression-aware flash translation layer (FTL), called CaFTL, which supports compressionaware address mapping and garbage collection for BlueZIP. For poorly compressed pages, CaFTL supports selective compression so that unnecessary compression can be avoided. We have implemented a complete SSD prototype with BlueZIP on an FPGA-based custom SSD platform and evaluated its effectiveness using realistic workloads. Our evaluation results show that BlueZIP can increase the lifetime of the SSD prototype by 26% as well as improve read and write performance by 20% and 27%, respectively, on average1.

field programmable logic and applications | 2014

The LEAP FPGA operating system

Kermin Fleming; Hsin-Jung Yang; Michael Adler; Joel S. Emer

FPGAs offer attractive power and performance for many applications, especially relative to traditional sequential architectures. In spite of these advantages, FPGAs have been deployed in only a few, niche domains.We argue that the difficulty of programming FPGAs all but precludes their use in more general systems: FPGA programmers are currently exposed to all the gory system details that software operating systems long ago abstracted away. In this work, we present the Latency-insensitive Environment for Application Programming (LEAP), an FPGA operating system built around latency-insensitive communications channels. LEAP alleviates the FPGA programming problem by providing a rich set of portable latency-insensitive abstraction layers for program development. Unlike software operating systems services, which are generally dynamic, the nature of FPGAs requires that many configuration decisions be made at compile time. We present an extensible interface for compile-time management of resources. We demonstrate that LEAP provides design portability, while consuming as little as 3% of FPGA area, by mapping several designs on to various FPGA platforms.

field-programmable custom computing machines | 2014

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories

Hsin-Jung Yang; Kermin Fleming; Michael Adler; Joel S. Emer

Parallel programming has been widely used in many scientific and technical areas to solve large problems. While general-purpose processors have rich infrastructure to support parallel programming on shared memory, such as coherent caches and synchronization libraries, parallel programming infrastructure for FPGAs is limited. Thus, development of FPGA-based parallel algorithms remains difficult. In this work, we seek to simplify parallel programming on FPGAs. We provide a set of easy-to-use declarative primitives to maintain coherency and consistency of accesses to shared memory resources. We propose a shared-memory service that automatically manages coherent caches on multiple FPGAs. Experimental results of a 2-dimensional heat transfer equation show that the shared memory service with our distributed coherent caches outperforms a centralized cache by 2.6x. To handle synchronization, we provide new lock and barrier primitives that leverage native FPGA communication capabilities and outperform traditional through-memory primitives by 1.8x.Multiplying a sparse matrix with a vector, denoted spmv, is a fundamental operation in linear algebra with several applications. Hence, efficient and scalable implementation of spmvs has been a topic of immense research. Recent efforts are aimed at implementations on GPUs, multicore architectures, and such emerging computational platforms. Owing to the highly irregular nature of spmv, it is observed that GPUs and CPUs can offer comparable performance. In this paper, we propose three heterogeneous algorithms for spmvs that simultaneously utilize both the CPU and the GPU. This is shown to lead to better resource utilization apart from performance gains. Our experiments of the work division schemes on standard datasets indicate that it is not in general possible to choose the most appropriate scheme given a matrix. We therefore consider a class of sparse matrices that exhibit a scale-free nature and identify a scheme that works well for such matrices. Finally, we use simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.With the development of social network services, the user relation spectrum of the social network has exceeded our imagination. Hence, personalized recommendation algorithms are adopted in many social networking sites to help users find their potential friends and related information more quickly and conveniently. In this paper, we discuss the weaknesses of current algorithms, and propose a user profile integrated dynamic social recommendation algorithm in order to overcome those limitations. Finally, through the experiment on Weibo dataset, it can conclude that the proposed algorithm outperforms traditional approaches in terms of accuracy and stability.

Explore More