Stephen Richardson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen Richardson is active.

Explore More

Publication

Featured researches published by Stephen Richardson.

symposium on computer arithmetic | 1993

Exploiting trivial and redundant computation

Stephen Richardson

The notion of trivial computation, in which the appearance of simple operands renders potentially complex operations simple, is discussed. An example of a trivial operation is integer division, where the divisor is two; the division becomes a simple shift operation. The concept of redundant computation, in which some operation repeatedly does the same function because it repeatedly sees the same operands, is also discussed. Experiments on two separate benchmark suites, the SPEC benchmarks and the Perfect Club, find a surprising amount of trivial and redundant operation. Various architectural means of exploiting this knowledge to improve computational efficiency include detection of trivial operands and the result cache. Further experimentation shows significant speedup from these techniques, as measured on three different styles of machine architecture.<<ETX>>

IEEE Micro | 2010

Rethinking Digital Design: Why Design Must Change

Ofer Shacham; Omid Azizi; Megan Wachs; Wajahat Qadeer; Zain Asgar; Kyle Kelley; John P. Stevenson; Stephen Richardson; Mark Horowitz; Benjamin C. Lee; Alex Solomatnikov; Amin Firoozshahian

Because of technology scaling, power dissipation is todays major performance limiter. Moreover, the traditional way to achieve power efficiency, application-specific designs, is prohibitively expensive. These power and cost issues necessitate rethinking digital design. To reduce design costs, we need to stop building chip instances, and start making chip generators instead. Domain-specific chip generators are templates that codify designer knowledge and design trade-offs to create different application-optimized chips.

Software - Practice and Experience | 1989

Interprocedual optimization: experimental results

Stephen Richardson; Mahadevan Ganapathi

The problem of tracking data flow across procedure boundaries has a long history of theoretical study by people who believed that such information would be useful for code optimization. Building upon previous work, an algorithm for interprocedural data flow analysis has been implemented. The algorithm produces three flow‐insensitive summary sets: MOD, USE and ALIASES. The utility of the resulting information was investigated using an optimizing Pascal compiler. Over a sampling of 27 bench‐marks, new optimizations performed as a result of interprocedural summary information contributed almost nothing to program execution speed. Finally, related optimization techniques of possibly greater potential are discussed.

Information Processing Letters | 1989

Interprocedural analysis vs. procedure integration

Stephen Richardson; Mahadevan Ganapathi

A set of experimental results show the exact run-time improvement due to both procedure integration and the use of interprocedural data-flow information, as well as the relative impact on compilation time and object code size

IEEE Computer | 1989

Code optimization across procedures

Stephen Richardson; Mahadevan Ganapathi

Procedure calls can be a major obstacle to the analysis of computer programs, preventing significant improvements in program speed. A broad range of techniques, each of which is in some sense interprocedural by nature, is considered to overcome this obstacle. Some techniques rely on interprocedural dataflow in their analysis. Others require interprocedural information in the form of detailed profile data or information concerning the scope of a given procedure in relation to other procedures. These include procedure integration, interprocedural register allocation, pointer and alias tracking, and dependency analysis.<<ETX>>

design automation conference | 2012

Avoiding game over: bringing design to the next level

Ofer Shacham; Megan Wachs; Andrew Danowitz; Sameh Galal; John S. Brunhaver; Wajahat Qadeer; Sabarish Sankaranarayanan; Artem Vassiliev; Stephen Richardson; Mark Horowitz

Technology scaling has created a catch-22: technology now can do almost anything we want, but the NRE design costs are so high, that almost no one can afford to use it. Our current situation is reminiscent of the 1980s, when only a few companies could afford to produce custom silicon. Synthesis and placement and routing tools changed this, by providing modular tools with well defined interfaces that codified designer knowledge about the physical design of chips. Now we need a new set of tools that can codify designer knowledge about how to construct software, hardware, and validation to again enable application designers to produce chips. Researchers are developing methodologies that allow users to create hardware constructors, or generators. These include Genesis 2, which extends SystemVerilog and enables the designer to encode hierarchical system construction procedu-rally. To demonstrate some of the capabilities that these languages and tools provide, we describe FPGen, a complete floating point generator written in Genesis 2, that also generates the needed validation collateral and hints for the backend processes.

international symposium on microarchitecture | 2008

Verification of chip multiprocessor memory systems using a relaxed scoreboard

Ofer Shacham; Megan Wachs; Alex Solomatnikov; Amin Firoozshahian; Stephen Richardson; Mark Horowitz

Verification of chip multiprocessor memory systems remains challenging. While formal methods have been used to validate protocols, simulation is still the dominant method used to validate memory system implementation. Having a memory scoreboard, a high-level model of the memory, greatly aids simulation based validation, but accurate score-boards are complex to create since often they depend not only on the memory and consistency model but also on its specific implementation. This paper describes a methodology of using a relaxed scoreboard, which greatly reduces the complexity of creating these memory models. The relaxed scoreboard tracks the operations of the system to maintain a set of values that could possibly be valid for each memory location. By allowing multiple possible values, the model used in the scoreboard is only loosely coupled with the specific design, which decouples the construction of the checker from the implementation, allowing the checker to be used early in the design and to be built up incrementally, and greatly reduces the scoreboard design effort. We demonstrate the use of the relaxed scoreboard in verifying RTL implementations of two different memory models, Transactional Coherency and Consistency (TCC) and Relaxed Consistency, for up to 32 processors. The resulting checker has a performance slowdown of 19% for checking Relaxed Consistency, and less than 30% for TCC, allowing it to be used in all simulation runs.

IEEE Design & Test of Computers | 2017

Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

Ardavan Pedram; Stephen Richardson; Mark Horowitz; Sameh Galal; Shahar Kvatinsky

Unlike traditional dark silicon works that attack the computing logic, this article puts a focus on the memory part, which dissipates most of the energy for memory-bound CPU applications. This article discusses the dark memory state and present Pareto curves for compute units, accelerators, and on-chip memory, and motivates the need for HW/SW codesign for parallelism and locality. –Muhammad Shafique, Vienna University of Technology

international symposium on microarchitecture | 2009

Using a configurable processor generator for computer architecture prototyping

Alex Solomatnikov; Amin Firoozshahian; Ofer Shacham; Zain Asgar; Megan Wachs; Wajahat Qadeer; Stephen Richardson; Mark Horowitz

Building hardware prototypes for computer architecture research is challenging. Unfortunately, development of the required software tools (compilers, debuggers, runtime) is even more challenging, which means these systems rarely run real applications. To overcome this issue, when developing our prototype platform, we used the Tensilica processor generator to produce a customized processor and corresponding software tools and libraries. While this base processor was very different from the streamlined custom processor we initially imagined, it allowed us to focus on our main objective - the design of a reconfigurable CMP memory system - and to successfully tape out an 8-core CMP chip with only a small group of designers. One person was able to handle processor configuration and hardware generation, support of a complete software tool chain, as well as developing the custom runtime software to support three different programming models. Having a sophisticated software tool chain not only allowed us to run more applications on our machine, it once again pointed out the need to use optimized code to get an accurate evaluation of architectural features.

ACM Sigarch Computer Architecture News | 2005

A chip prototyping substrate: the flexible architecture for simulation and testing (FAST)

John D. Davis; Stephen Richardson; Charis Charitsis; Kunle Olukotun

We describe a hybrid hardware emulation environment: the Flexible Architecture for Simulation and Testing (FAST). FAST integrates field-programmable gate arrays (FPGAs), microprocessors, and memory to enable rapid prototyping of chip multiprocessors, multithreaded architectures, or other novel computer architectures and chip-level memory systems. FAST combines configurable and fixed-function hardware and software to facilitate rapid prototyping by utilizing components optimized for their particular tasks: FPGAs for interconnect and glue logic; processors for rapid program execution; and SRAMs for fast memory. Unlike software simulators, FAST can simulate complex designs at multi-megahertz speeds regardless of the simulation detail. We illustrate FASTs utility by describing mappings of both a small-scale CMP with speculation support and a large-scale CMP connected using a network. We then show performance results from a very simple, decoupled 4-way CMP executing small test programs.

Explore More