Ofer Shacham | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ofer Shacham is active.

Explore More

Publication

Featured researches published by Ofer Shacham.

international symposium on computer architecture | 2013

Convolution engine: balancing efficiency & flexibility in specialized computing

Wajahat Qadeer; Rehan Hameed; Ofer Shacham; Preethi Venkatesan; Christos Kozyrakis; Mark Horowitz

This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications. We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.

IEEE Micro | 2010

Rethinking Digital Design: Why Design Must Change

Ofer Shacham; Omid Azizi; Megan Wachs; Wajahat Qadeer; Zain Asgar; Kyle Kelley; John P. Stevenson; Stephen Richardson; Mark Horowitz; Benjamin C. Lee; Alex Solomatnikov; Amin Firoozshahian

Because of technology scaling, power dissipation is todays major performance limiter. Moreover, the traditional way to achieve power efficiency, application-specific designs, is prohibitively expensive. These power and cost issues necessitate rethinking digital design. To reduce design costs, we need to stop building chip instances, and start making chip generators instead. Domain-specific chip generators are templates that codify designer knowledge and design trade-offs to create different application-optimized chips.

design automation conference | 2012

Avoiding game over: bringing design to the next level

Ofer Shacham; Megan Wachs; Andrew Danowitz; Sameh Galal; John S. Brunhaver; Wajahat Qadeer; Sabarish Sankaranarayanan; Artem Vassiliev; Stephen Richardson; Mark Horowitz

Technology scaling has created a catch-22: technology now can do almost anything we want, but the NRE design costs are so high, that almost no one can afford to use it. Our current situation is reminiscent of the 1980s, when only a few companies could afford to produce custom silicon. Synthesis and placement and routing tools changed this, by providing modular tools with well defined interfaces that codified designer knowledge about the physical design of chips. Now we need a new set of tools that can codify designer knowledge about how to construct software, hardware, and validation to again enable application designers to produce chips. Researchers are developing methodologies that allow users to create hardware constructors, or generators. These include Genesis 2, which extends SystemVerilog and enables the designer to encode hierarchical system construction procedu-rally. To demonstrate some of the capabilities that these languages and tools provide, we describe FPGen, a complete floating point generator written in Genesis 2, that also generates the needed validation collateral and hints for the backend processes.

international symposium on microarchitecture | 2008

Verification of chip multiprocessor memory systems using a relaxed scoreboard

Ofer Shacham; Megan Wachs; Alex Solomatnikov; Amin Firoozshahian; Stephen Richardson; Mark Horowitz

Verification of chip multiprocessor memory systems remains challenging. While formal methods have been used to validate protocols, simulation is still the dominant method used to validate memory system implementation. Having a memory scoreboard, a high-level model of the memory, greatly aids simulation based validation, but accurate score-boards are complex to create since often they depend not only on the memory and consistency model but also on its specific implementation. This paper describes a methodology of using a relaxed scoreboard, which greatly reduces the complexity of creating these memory models. The relaxed scoreboard tracks the operations of the system to maintain a set of values that could possibly be valid for each memory location. By allowing multiple possible values, the model used in the scoreboard is only loosely coupled with the specific design, which decouples the construction of the checker from the implementation, allowing the checker to be used early in the design and to be built up incrementally, and greatly reduces the scoreboard design effort. We demonstrate the use of the relaxed scoreboard in verifying RTL implementations of two different memory models, Transactional Coherency and Consistency (TCC) and Relaxed Consistency, for up to 32 processors. The resulting checker has a performance slowdown of 19% for checking Relaxed Consistency, and less than 30% for TCC, allowing it to be used in all simulation runs.

application specific systems architectures and processors | 2012

Design Automation Framework for Application-Specific Logic-in-Memory Blocks

Qiuling Zhu; Kaushik Vaidyanathan; Ofer Shacham; Mark Horowitz; Lawrence T. Pileggi; Franz Franchetti

This paper presents a design methodology forhardware synthesis of application-specific logic-in-memory(LiM) blocks. Logic-in-memory designs tightly integrate specializedcomputation logic with embedded memory, enablingmore localized computation, thus save energy consumption. Asa demonstration, we present an end-to-end design frameworkto automatically synthesize an interpolation based logic-in-memoryblock named interpolation memory, which combinesa seed table with simple arithmetic logic to efficiently evaluatefunctions. In order to support multiple consecutive seed dataaccess that is required in the interpolation operation, wesynthesize the physical memory into the novel rectangular accesssmart memory blocks. We evaluated a large designspace of interpolation memories in sub-20 nm commercialCMOS technology by using the proposed design framework.Furthermore, we implemented a logic-in-memory based computedtomography (CT) medical image reconstruction systemand our experimental results show that the logic-in-memorycomputing method achieves orders of magnitude of energysaving compared with the traditional in-processor computing.

Communications of The ACM | 2015

Convolution engine: balancing efficiency and flexibility in specialized computing

Wajahat Qadeer; Rehan Hameed; Ofer Shacham; Preethi Venkatesan; Christos Kozyrakis; Mark Horowitz

General-purpose processors, while tremendously versatile, pay a huge cost for their flexibility by wasting over 99% of the energy in programmability overheads. We observe that reducing this waste requires tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the algorithms. Hence, by backing off from full programmability and instead targeting key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications within that domain. We present the Convolution Engine (CE)---a programmable processor specialized for the convolution-like data-flow prevalent in computational photography, computer vision, and video processing. The CE achieves energy efficiency by capturing data-reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We demonstrate that the CE is within a factor of 2--3× of the energy and area efficiency of custom units optimized for a single kernel. The CE improves energy and area efficiency by 8--15× over data-parallel Single Instruction Multiple Data (SIMD) engines for most image processing applications.

international symposium on microarchitecture | 2009

Using a configurable processor generator for computer architecture prototyping

Alex Solomatnikov; Amin Firoozshahian; Ofer Shacham; Zain Asgar; Megan Wachs; Wajahat Qadeer; Stephen Richardson; Mark Horowitz

Building hardware prototypes for computer architecture research is challenging. Unfortunately, development of the required software tools (compilers, debuggers, runtime) is even more challenging, which means these systems rarely run real applications. To overcome this issue, when developing our prototype platform, we used the Tensilica processor generator to produce a customized processor and corresponding software tools and libraries. While this base processor was very different from the streamlined custom processor we initially imagined, it allowed us to focus on our main objective - the design of a reconfigurable CMP memory system - and to successfully tape out an 8-core CMP chip with only a small group of designers. One person was able to handle processor configuration and hardware generation, support of a complete software tool chain, as well as developing the custom runtime software to support three different programming models. Having a sophisticated software tool chain not only allowed us to run more applications on our machine, it once again pointed out the need to use optimized code to get an accurate evaluation of architectural features.

ifip ieee international conference on very large scale integration | 2013

An area-efficient minimum-time FFT schedule using single-ported memory

Stephen Richardson; Ofer Shacham; Dejan Markovic; Mark Horowitz

FFT design requires an exhaustive recoupling of data across successive stages of computation. The resulting memory access patterns have constantly-changing strides, making it hard to interleave the data for reliable conflict-free access of operand pairs. We modify an existing method of “swizzling” data locations so as to guarantee conflict-free access within any given stage and, with minimal support for buffering, we provide conflict-free access across the boundaries of adjoining stages as well. As a result, implementations that would naively require either a fully-associative, or at the very least a multiported register file, can be implemented using four single-ported banks of memory per butterfly unit, plus one bypass buffer. Because fewer ports means less area, and given that a butterfly must read two inputs and write two results for each cycle of operation, this solution should represent the least-area memory configuration for a resource-constrained FFT. Using this scheme, we show examples including a minimal one-butterfly FFT having 9% less area versus a competing equal-performance design and 20% better throughput versus a competing equal-area design.

IEEE Design & Test of Computers | 2012

Bringing up a chip on the cheap

Megan Wachs; Ofer Shacham; Zain Asgar; Amin Firoozshahian; Stephen Richardson; Mark Horowitz

Booting and debugging the functionality of silicon samples are known to be challenging and time-consuming tasks, even more so in cost-constrained environments. The authors describe their creative solutions used to bring up Stanford Smart Memories (SSM), a 55-million transistor research chip.

international symposium on computer architecture | 2009