Martha A. Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martha A. Kim is active.

Explore More

Publication

Featured researches published by Martha A. Kim.

architectural support for programming languages and operating systems | 2014

Q100: the architecture and design of a database processing unit

Lisa Wu; Andrea Lottarini; Timothy K. Paine; Martha A. Kim; Kenneth A. Ross

In this paper, we propose Database Processing Units, or DPUs, a class of domain-specific database processors that can efficiently handle database applications. As a proof of concept, we present the instruction set architecture, microarchitecture, and hardware implementation of one DPU, called Q100. The Q100 has a collection of heterogeneous ASIC tiles that process relational tables and columns quickly and energy-efficiently. The architecture uses coarse grained in- structions that manipulate streams of data, thereby maximizing pipeline and data parallelism, and minimizing the need to time multiplex the accelerator tiles and spill inter- mediate results to memory. This work explores a Q100 de- sign space of 150 configurations, selecting three for further analysis: a small, power-conscious implementation, a high- performance implementation, and a balanced design that maximizes performance per Watt. We then demonstrate that the power-conscious Q100 handles the TPC-H queries with three orders of magnitude less energy than a state of the art software DBMS, while the performance-oriented design out- performs the same DBMS by 70X.

international symposium on computer architecture | 2013

Navigating big data with high-throughput, energy-efficient data partitioning

Lisa Wu; Raymond J. Barker; Martha A. Kim; Kenneth A. Ross

The global pool of data is growing at 2.5 quintillion bytes per day, with 90% of it produced in the last two years alone [24]. There is no doubt the era of big data has arrived. This paper explores targeted deployment of hardware accelerators to improve the throughput and energy efficiency of large-scale data processing. In particular, data partitioning is a critical operation for manipulating large data sets. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. To accelerate partitioning, this paper describes a hardware accelerator for range partitioning, or HARP, and a hardware-software data streaming framework. The streaming framework offers a seamless execution environment for streaming accelerators such as HARP. Together, HARP and the streaming framework provide an order of magnitude improvement in partitioning performance and energy. A detailed analysis of a 32nm physical design shows 7.8 times the throughput of a highly optimized and optimistic software implementation, while consuming just 6.9% of the area and 4.3% of the power of a single Xeon core in the same technology generation.

ieee international conference on high performance computing data and analytics | 2012

Measuring interference between live datacenter applications

Melanie Kambadur; Tipp Moseley; Rick Hank; Martha A. Kim

Application interference is prevalent in datacenters due to contention over shared hardware resources. Unfortunately, understanding interference in live datacenters is more difficult than in controlled environments or on simpler architectures. Most approaches to mitigating interference rely on data that cannot be collected efficiently in a production environment. This work exposes eight specific complexities of live datacenters that constrain measurement of interference. It then introduces new, generic measurement techniques for analyzing interference in the face of these challenges and restrictions. We use the measurement techniques to conduct the first large-scale study of application interference in live production datacenter workloads. Data is measured across 1000 12-core Google servers observed to be running 1102 unique applications. Finally, our work identifies several opportunities to improve performance that use only the available data; these opportunities are applicable to any datacenter.

international symposium on computer architecture | 2012

Harmony: collection and analysis of parallel block vectors

Melanie Kambadur; Kui Tang; Martha A. Kim

Efficient execution of well-parallelized applications is central to performance in the multicore era. Program analysis tools support the hardware and software sides of this effort by exposing relevant features of multithreaded applications. This paper describes parallel block vectors, which uncover previously unseen characteristics of parallel programs. Parallel block vectors provide block execution profiles per concurrency phase (e.g., the block execution profile of all serial regions of a program). This information provides a direct and fine-grained mapping between an applications runtime parallel phases and the static code that makes up those phases. This paper also demonstrates how to collect parallel block vectors with minimal application perturbation using Harmony. Harmony is an instrumentation pass for the LLVM compiler that introduces just 16-21% overhead on average across eight Parsec benchmarks. We apply parallel block vectors to uncover several novel insights about parallel applications with direct consequences for architectural design. First, that the serial and parallel phases of execution used in Amdahls Law are often composed of many of the same basic blocks. Second, that program features, such as instruction mix, vary based on the degree of parallelism, with serial phases in particular displaying different instruction mixes from the program as a whole. Third, that dynamic execution frequencies do not necessarily correlate with a blocks parallelism.

conference on object-oriented programming systems, languages, and applications | 2014

An experimental survey of energy management across the stack

Melanie Kambadur; Martha A. Kim

Modern demand for energy-efficient computation has spurred research at all levels of the stack, from devices to microarchitecture, operating systems, compilers, and languages. Unfortunately, this breadth has resulted in a disjointed space, with technologies at different levels of the system stack rarely compared, let alone coordinated. This work begins to remedy the problem, conducting an experimental survey of the present state of energy management across the stack. Focusing on settings that are exposed to software, we measure the total energy, average power, and execution time of 41 benchmark applications in 220 configurations, across a total of 200,000 program executions. Some of the more important findings of the survey include that effective parallelization and compiler optimizations have the potential to save far more energy than Linuxs frequency tuning algorithms; that certain non-complementary energy strategies can undercut each others savings by half when combined; and that while the power impacts of most strategies remain constant across applications, the runtime impacts vary, resulting in inconsistent energy impacts.

ACM Transactions on Computer Systems | 2014

Energy Analysis of Hardware and Software Range Partitioning

Lisa Wu; Orestis Polychroniou; Raymond J. Barker; Martha A. Kim; Kenneth A. Ross

Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency. The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning. For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.

international conference on hardware/software codesign and system synthesis | 2015

Hardware synthesis from a recursive functional language

Kuangya Zhai; Richard Morse Townsend; Lianne Elizabeth Lairmore; Martha A. Kim; Stephen A. Edwards

Abstraction in hardware description languages stalled at the register-transfer level decades ago, yet few alternatives have had much success, in part because they provide only modest gains in expressivity. We propose to make a much larger jump: a compiler that synthesizes hardware from behavioral functional specifications. Our compiler translates general Haskell programs into a restricted intermediate representation before applying a series of semantics-preserving transformations, concluding with a simple syntax-directed translation to SystemVerilog. Here, we present the overall framework for this compiler, focusing on the intermediate representations involved and our method for translating general recursive functions into equivalent hardware. We conclude with experimental results that depict the performance and resource usage of the circuitry generated with our compiler.

european conference on parallel processing | 2014

ParaShares: Finding the Important Basic Blocks in Multithreaded Programs

Melanie Kambadur; Kui Tang; Martha A. Kim

Understanding and optimizing multithreaded execution is a significant challenge. Numerous research and industrial tools debug parallel performance by combing through program source or thread traces for pathologies including communication overheads, data dependencies, and load imbalances. This work takes a new approach: it ignores any underlying pathologies, and focuses instead on pinpointing the exact locations in source code that consume the largest share of execution. Our new metric, ParaShares, scores and ranks all basic blocks in a program based on their share of parallel execution. For the eight benchmarks examined in this paper, ParaShare rankings point to just a few important blocks per application. The paper demonstrates two uses of this information, exploring how the important blocks vary across thread counts and input sizes, and making modest source code changes (fewer than 10 lines of code) that result in 14-92% savings in parallel program runtime.

symposium on code generation and optimization | 2016

NRG-loops: adjusting power from within applications

Melanie Kambadur; Martha A. Kim

NRG-Loops are source-level abstractions that allow an application to dynamically manage its power and energy through adjustments to functionality, performance, and accuracy. The adjustments, which come in the form of truncated, adapted, or perforated loops, are conditionally enabled as runtime power and energy constraints dictate. NRG-Loops are portable across different hardware platforms and operating systems and are complementary to existing system-level efficiency techniques, such as DVFS and idle states. Using a prototype C library supported by commodity hardware energy meters (and with no modifications to the compiler or operating system), this paper demonstrates four NRG-Loop applications that in 2-6 lines of source code changes can save up to 55% power and 90% energy, resulting in up to 12X better energy efficiency than system-level techniques.

formal methods | 2015

Implementing latency-insensitive dataflow blocks

Bingyi Cao; Kenneth A. Ross; Martha A. Kim; Stephen A. Edwards

To simplify the implementation of dataflow systems in hardware, we present a technique for designing latency- insensitive dataflow blocks. We provide buffering with backpressure, resulting in blocks that compose into deep, high-speed pipelines without introducing long combinational paths. Our input and output buffers are easy to assemble into simple unit- rate dataflow blocks, arbiters, and blocks for Kahn networks. We prove the correctness of our buffers, illustrate how they can be used to assemble arbitrary dataflow blocks, discuss pitfalls, and present experimental results that suggest our pipelines can operate at a high clock rate independent of length.

Explore More