Is this you? Create Your Porfile

Ronny Krashinsky

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ronny Krashinsky is active.

Explore More

Publication

Featured researches published by Ronny Krashinsky.

international symposium on computer architecture | 2004

The Vector-Thread Architecture

Ronny Krashinsky; Christopher Batten; Mark Hampton; Steve Gerding; Brian Pharris; Jared Casper; Krste Asanovic

The vector-thread (VT) architectural paradigm unifies the vectorand multithreaded compute models. The VT abstraction providesthe programmer with a control processor and a vector of virtualprocessors (VPs). The control processor can use vector-fetch commandsto broadcast instructions to all the VPs or each VP can usethread-fetches to direct its own control flow. A seamless intermixingof the vector and threaded control mechanisms allows a VT architectureto flexibly and compactly encode application parallelismand locality, and a VT machine exploits these to improve performanceand efficiency. We present SCALE, an instantiation of theVT architecture designed for low-power and high-performance embeddedsystems. We evaluate the SCALE prototype design usingdetailed simulation of a broad range of embedded applications andshow that its performance is competitive with larger and more complexprocessors.

Wireless Networks | 2005

Minimizing energy for wireless web access with bounded slowdown

Ronny Krashinsky; Hari Balakrishnan

Abstract On many battery-powered mobile computing devices, the wireless network is a significant contributor to the total energy consumption. In this paper, we investigate the interaction between energy-saving protocols and TCP performance for Web-like transfers. We show that the popular IEEE 802.11 power-saving mode (PSM), a “static” protocol, can harm performance by increasing fast round trip times (RTTs) to 100 ms; and that under typical Web browsing workloads, current implementations will unnecessarily spend energy waking up during long idle periods. To overcome these problems, we present the Bounded-Slowdown (BSD) protocol, a PSM that dynamically adapts to network activity. BSD is an optimal solution to the problem of minimizing energy consumption while guaranteeing that a connection’s RTT does not increase by more than a factor p over its base RTT, where p is a protocol parameter that exposes the trade-off between minimizing energy and reducing latency. We present several trace-driven simulation results that show that, compared to a static PSM, the Bounded-Slowdown protocol reduces average Web page retrieval times by 5–64%, while simultaneously reducing energy consumption by 1–14% (and by 13× compared to no power management).

international symposium on microarchitecture | 2012

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

Mark Gebhart; Stephen W. Keckler; Brucek Khailany; Ronny Krashinsky; William J. Dally

Modern throughput processors such as GPUs employ thousands of threads to drive high-bandwidth, long-latency memory systems. These threads require substantial on-chip storage for registers, cache, and scratchpad memory. Existing designs hard-partition this local storage, fixing the capacities of these structures at design time. We evaluate modern GPU workloads and find that they have widely varying capacity needs across these different functions. Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a per-application basis. The tuning that this flexibility enables improves both performance and energy consumption, and broadens the scope of applications that can be efficiently executed on GPUs. Compared to a hard-partitioned design, we show that unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33%.

IEEE Transactions on Very Large Scale Integration Systems | 2007

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy

Seongmoo Heo; Ronny Krashinsky; Krste Asanovic

international symposium on microarchitecture | 2004

Cache Refill/Access Decoupling for Vector Machines

Christopher Batten; Ronny Krashinsky; Steve Gerding; Krste Asanovic

Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor with a Vector Refill Unit (VRU) to quickly pre-execute vector memory commands and issue any needed cache line refills ahead of regular execution. The VRU reduces costs by eliminating much of the outstanding miss state required in traditional vector architectures and by using the cache itself as a cost-effective prefetch buffer. We also introduce vector segment accesses, a new class of vector memory instructions that efficiently encode two-dimensional access patterns. Segments reduce address bandwidth demands and enable more efficient refill/access decoupling by increasing the information contained in each vector memory command. Our results show that refill/access decoupling is able to achieve better performance with less resources than more traditional decoupling methods. Even with a small cache and memory latencies as long as 800 cycles, refill/access decoupling can sustain several kilobytes of in-flight data with minimal access management state and no need for expensive reserved element buffering.

ACM Sigarch Computer Architecture News | 2001

Multithreading decoupled architectures for complexity-effective general purpose computing

Michael Sung; Ronny Krashinsky; Krste Asanovic

Decoupled architectures have not traditionally been used in the context of general purpose computing because of their inability to tolerate control-intensive code that exists across a wide range of applications. This work investigates the possibility of using multithreading to overcome the loss of decoupling dependencies that represent the cause of this main limitation in decoupled architectures. A proposal for a multithreaded decoupled control/access/execute architecture is presented as a platform for achieving high performance on general purpose workloads. It is argued that such a decoupled architecture is more complexity-effective and scalable than comparable superscalar processors, which incorporate enormous amounts of complexity for modest performance gains.

conference on advanced research in vlsi | 2001

Activity-sensitive flip-flop and latch selection for reduced energy

Seongmoo Heo; Ronny Krashinsky; Krste Asanovic

This paper presents new techniques to evaluate the energy and delay of flip-flop and latch designs and shows that no single existing design performs well across the wide range of operating regimes present in complex systems. We propose the use of a selection of flip-flop and latch designs, each tuned for different activation patterns and speed requirements. We illustrate our technique on a pipelined MIPS processor datapath running SPECint95 benchmarks, where we reduce total flip-flop and latch energy by over 60% without increasing cycle time.This article presents new techniques to evaluate the energy and delay of flip-flop and latch designs and shows that no single existing design performs well across the wide range of operating regimes present in complex systems. We prepose the use of a selection of flip-flop latch designs, each timed for different activation patterns and speed requirements. We illustrate the use of our technique on a pipelined MIPS processor datapath running SPECint95 benchmarks, where we reduce total flip-flop and latch energy by 60% without increasing cycle time.

ACM Transactions on Design Automation of Electronic Systems | 2008

Implementing the scale vector-thread processor

Ronny Krashinsky; Christopher Batten; Krste Asanovic

The Scale vector-thread processor is a complexity-effective solution for embedded computing which flexibly supports both vector and highly multithreaded processing. The 7.1-million transistor chip has 16 decoupled execution clusters, vector load and store units, and a nonblocking 32KB cache. An automated and iterative design and verification flow enabled a performance-, power-, and area-efficient implementation with two person-years of development effort. Scale has a core area of 16.6 mm2 in 180 nm technology, and it consumes 400 mW--1.1 W while running at 260 MHz.

Power aware computing | 2002

Energy-exposed instruction sets

Krste Asanovic; Mark Hampton; Ronny Krashinsky; Emmett Witchel

Modern performance-oriented ISAs, such as RISC and VLIW, only expose to software features that impact the critical path through computation. Pipelined microprocessor implementations hide most of the microarchitectural work performed in executing instructions. Therefore, there is no incentive to expose these micro-operations, and their energy consumption is hidden from software.This work presents energy-exposed hardware-software interfaces to give software more fine-grain control over energy-consuming microarchitectural operations. We introduce software restart markers to make temporary processor state visible to software without complicating hardware exception management. This technique can enable a wide variety of energy optimizations. We implement exposed bypass latches which allow the compiler to eliminate register file traffic by directly targeting the processor bypass latches. Another technique, tagunchecked loads and stores, allows software to access cache data without a hard-ware tag check when the compiler can guarantee an access will be to the same line as an earlier access.

Archive | 2007