Shai Rotem
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shai Rotem.
IEEE Journal of Solid-state Circuits | 2001
Kenneth S. Stevens; Shai Rotem; Ran Ginosar; Peter A. Beerel; Chris J. Myers; Kenneth Y. Yun; R. Koi; Charles E. Dike; Marly Roncken
This paper describes an investigation of potential advantages and pitfalls of applying an asynchronous design methodology to an advanced microprocessor architecture. A prototype complex instruction set length decoding and steering unit was implemented using self-timed circuits. [The Revolving Asynchronous Pentium/sup (R)/ Processor Instruction Decoder (RAPPID) design implemented the complete Pentium II/sup (R)/ 32-bit MMX instruction set.] The prototype chip was fabricated on a 0.25 /spl mu/m CMOS process and tested successfully. Results show significant advantages - in particular, performance of 2.5-4.5 instructions per nanosecond - with manageable risks using this design technology. The prototype achieves three times the throughput and half the latency, dissipating only half the power and requiring about the same area as the fastest commercial 400 MHz clocked circuit fabricated on the same process.
IEEE Transactions on Very Large Scale Integration Systems | 2003
Kenneth S. Stevens; Ran Ginosar; Shai Rotem
Relative timing (RT) is introduced as a method for asynchronous design. Timing requirements of a circuit are made explicit using relative timing. Timing can be directly added, removed, and optimized using this style. RT synthesis and verification are demonstrated on three example circuits, facilitating transformations from speed-independent circuits to burst-mode and pulse-mode circuits. Relative timing enables improved performance, area, power, and functional testability of up to a factor of 3/spl times/ in all three cases. This method is the foundation of optimized timed circuit designs used in an industrial test chip, and may be formalized and automated.
international symposium on advanced research in asynchronous circuits and systems | 1998
Wei-Chun Chou; Peter A. Beerel; Ran Ginosar; Rakefet Kol; Chris J. Myers; Shai Rotem; Kenneth S. Stevens; Kenneth Y. Yun
This paper presents a technology mapping technique for optimizing the average-case delay of asynchronous combinational circuits implemented using domino logic and one-hot encoded outputs. The technique minimizes the critical path for common input patterns at the possible expense of making less common critical paths longer. To demonstrate the application of this technique, we present a case study of a combinational length decoding block, an integral component of an Asynchronous Instruction Length Decoder (AILD) which can be used in Pentium(R) processors. The experimental results demonstrate that the average-case delay of our mapped circuits can be dramatically lower than the worst-case delay of the circuits obtained using conventional worst-case mapping techniques.
symposium on code generation and optimization | 2010
Edson Borin; Youfeng Wu; Cheng Wang; Wei Liu; Mauricio Breternitz; Shiliang Hu; Esfir Natanzon; Shai Rotem; Roni Rosner
Dynamic binary translation is a key component of Hardware/Software (HW/SW) co-design, which is an enabling technology for processor microarchitecture innovation. There are two well-known dynamic binary optimization techniques based on atomic execution support. Frame-based optimizations leverage processor pipeline support to enable atomic execution of hot traces. Region level optimizations employ transactional-memory-like atomicity support to aggressively optimize large regions of code. In this paper we propose a two-level atomic optimization scheme which not only overcomes the limitations of the two approaches, but also boosts the benefits of the two approaches effectively. Our experiment shows that the combined approach can achieve a total of 21.5% performance improvement over an aggressive out-of-order baseline machine and improve the performance over the frame-based approach by an additional 5.3%.
international symposium on advanced research in asynchronous circuits and systems | 2000
Marly Roncken; Kenneth S. Stevens; Rajesh Pendurkar; Shai Rotem; Parimal Pal Chaudhuri
This paper presents a case study in low-cost noninvasive Built-in Self Test (BIST) for RAPPID, a large-scale 120,000-transistor asynchronous version of the Pentium(R) Pro Instruction Length Decoded which runs at 3.6 GHz. RAPPID uses a synchronous 0.25 micron CMOS library for static and domino logic, and has no Design-for-Test hooks other than some debug features. We explore the use of Cellular Automata (CA) for on-chip test pattern generation and response evaluation. More specifically, we look for fast ways to tune the CA-BIST to the RAPPID design, rather than using pseudo-random testing. The metric for tuning the CA-BIST pattern generation is based on an abstract hardware description model of the instruction length decodes which is independent of implementation details, and hence also independent of the asynchronous circuit style. Our CA-BIST solution uses a novel bootstrap procedure for generating the test patterns, which give complete coverage for this metric, and cover 94% of the testable stuck-at faults for the actual design at switch level. Analysis of the undetected and untestable faults shows that the same fault effects can be expected for a similar clocked circuit. This is encouraging evidence that testability is no excuse to avoid asynchronous design techniques in addition to high-performance synchronous solutions.
design automation conference | 2002
Sumit Gupta; Nick Savoiu; Nikil D. Dutt; Rajesh K. Gupta; Alexandru Nicolau; Timothy Kam; Michael Kishinevsky; Shai Rotem
High performance microprocessor designs are partially characterized by functional blocks consisting of a large number of operations that are packed into very few cycles (often single-cycle) with little or no resource constraints but tight bounds on the cycle time. Extreme parallelization, conditional and speculative execution of operations is essential to meet the processor performance goals. However, this is a tedious task for which classical high-level synthesis (HLS) formulations are inadequate and thus rarely used. In this paper, we present a new methodology for application of HLS targeted to such microprocessor functional blocks that can potentially speed up the design space exploration for microprocessor designs. Our methodology consists of a coordinated set of source-level and fine-grain parallelizing compiler transformations that targets these behavioral descriptions, specifically loop constructs in them and enables efficient chaining of operations and high-level synthesis of the functional blocks. As a case study in understanding the complexity and challenges in the use of HLS, we walk the reader through the detailed design of an instruction length decoder drawn from the Pentium®-family of processors. The chief contribution of this paper is formulation of a domain-specific methodology for application of high-level synthesis techniques to a domain that rarely, if ever, finds use for it.
international symposium on advanced research in asynchronous circuits and systems | 1999
Kenneth S. Stevens; Ran Ginosar; Shai Rotem
international symposium on advanced research in asynchronous circuits and systems | 1999
Shai Rotem; Kenneth S. Stevens; Ran Ginosar; Peter A. Beerel; Chris J. Myers; Kenneth Y. Yun; Rakefet Kol; Charles E. Dike; Marly Roncken; Boris Agapiev
Archive | 2015
Sanjeev Jahagirdar; Varghese George; John B. Conrad; Robert Milstrey; Stephen A. Fischer; Alon Naveh; Shai Rotem
Archive | 1993
Shai Rotem; Ze'ev Shtadler