James C. Hoe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James C. Hoe is active.

Explore More

Publication

Featured researches published by James C. Hoe.

international symposium on computer architecture | 2003

SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Roland E. Wunderlich; Thomas F. Wenisch; Babak Falsafi; James C. Hoe

Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents the Sampling Microarchitecture Simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates.Analysis of 41 of the 45 possible SPEC2K benchmark/input combinations show CPI and energy per instruction (EPI) can be estimated to within ±3% with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to ∼2% for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64% on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively.

international symposium on microarchitecture | 2007

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe

In deep sub-micron ICs, growing amounts of on-die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory protection techniques can neither detect nor correct large-scale multi-bit errors without incurring large performance, area, and power overheads. We propose two-dimensional (2D) error coding in embedded memories, a scalable multi-bit error protection technique to improve memory reliability and yield. The key innovation is the use of vertical error coding across words that is used only for error correction in combination with conventional per-word horizontal error coding. We evaluate this scheme in the cache hierarchies of two representative chip multiprocessor designs and show that 2D error coding can correct clustered errors up to 32times32 bits with significantly smaller performance, area, and power overheads than conventional techniques.

international symposium on microarchitecture | 2006

SimFlex: Statistical Sampling of Computer System Simulation

Thomas F. Wenisch; Roland E. Wunderlich; Michael Ferdman; Anastassia Ailamaki; Babak Falsafi; James C. Hoe

Timing-accurate full-system multiprocessor simulations can take years because of architecture and application complexity. Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelism

international symposium on microarchitecture | 2001

Dual use of superscalar datapath for transient-fault detection and recovery

Joydeep Ray; James C. Hoe; Babak Falsafi

Diminutive devices and high clock frequency of future microprocessor generations are causing increased concerns for transient soft failures in hardware, necessitating fault detection and recovery mechanisms even in commodity processors. In this paper, we propose a fault-tolerant extension for modern superscalar out-of-order datapath that can be supported by only modest additional hardware. In the proposed extensions, error-detection is achieved by verifying the redundant results of dynamically replicated threads of executions, while the error-recovery scheme employs the instruction-rewind mechanism to restart at a failed instruction. We study the performance impact of augmenting superscalar microarchitectures with this fault tolerance mechanism. An analytical performance model is used in conjunction with a performance simulator The simulation results of 11 SPEC95 and SPEC2000 benchmarks show that in the absence of faults, error detection causes a 2% to 45% reduction in throughput, which is in line with other proposed detection schemes. In the presence of transient faults, the fast error recovery scheme contributes very little additional slowdown.

international symposium on microarchitecture | 2007

RAMP: Research Accelerator for Multiple Processors

John Wawrzynek; David A. Patterson; Mark Oskin; Shih-Lien Lu; Christoforos E. Kozyrakis; James C. Hoe; Derek Chiou; Krste Asanovic

The RAMP projects goal is to enable the intensive, multidisciplinary innovation that the computing industry will need to tackle the problems of parallel processing. RAMP itself is an open-source, community-developed, FPGA-based emulator of parallel architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable design modules. three complete designs - for transactional memory, distributed systems, and distributed-shared memory - demonstrate the platforms potential.

measurement and modeling of computer systems | 2004

SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture

Nikolaos Hardavellas; Stephen Somogyi; Thomas F. Wenisch; Roland E. Wunderlich; Shelley Chen; Jangwoo Kim; Babak Falsafi; James C. Hoe; Andreas G. Nowatzyk

The new focus on commercial workloads in simulation studies of server systems has caused a drastic increase in the complexity and decrease in the speed of simulation tools. The complexity of a large-scale full-system model makes development of a monolithic simulation tool a prohibitively difficult task. Furthermore, detailed full-system models simulate so slowly that experimental results must be based on simulations of only fractions of a second of execution of the modelled system.This paper presents SIMFLEX, a simulation framework which uses component-based design and rigorous statistical sampling to enable development of complex models and ensure representative measurement results with fast simulation turnaround. The novelty of SIMFLEX lies in its combination of a unique, compile-time approach to component interconnection and a methodology for obtaining accurate results from sampled simulations on a platform capable of evaluating unmodified commercial workloads.

international symposium on microarchitecture | 2006

Reunion: Complexity-Effective Multicore Redundancy

Jared C. Smolens; Brian T. Gold; Babak Falsafi; James C. Hoe

To protect processor logic from soft errors, multicore redundant architectures execute two copies of a program on separate cores of a chip multiprocessor (CMP). Maintaining identical instruction streams is challenging because redundant cores operate independently, yet must still receive the same inputs (e.g., load values and shared-memory invalidations). Past proposals strictly replicate load values across two cores, requiring significant changes to the highly-optimized core. We make the key observation that, in the common case, both cores load identical values without special hardware. When the cores do receive different load values (e.g., due to a data race), the same mechanisms employed for soft error detection and recovery can correct the difference. This observation permits designs that relax input replication, while still providing correct redundant execution. In this paper, we present Reunion, an execution model that provides relaxed input replication and preserves the existing memory interface, coherence protocols, and consistency models. We evaluate a CMP-based implementation of the Reunion execution model with full-system, cycle-accurate simulation. We show that the performance overhead of relaxed input replication is only 5% and 6% for commercial and scientific workloads, respectively

field programmable gate arrays | 2011

CoRAM: an in-fabric memory architecture for FPGA-based computing

Eric S. Chung; James C. Hoe; Ken Mai

FPGAs have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, FPGAs have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to FPGA-based computing today is the FPGAs lack of a common, scalable memory architecture. When developing applications for FPGAs, designers are often directly responsible for crafting the application-specific infrastructure logic that manages and transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will frequently lock a design to a particular FPGA product line, hindering scalability and portability. We propose a new FPGA memory architecture called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the external memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify development and to improve an applications portability and scalability.

ACM Transactions on Reconfigurable Technology and Systems | 2009

ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs

Eric S. Chung; Michael K. Papamichael; Eriko Nurvitadhi; James C. Hoe; Ken Mai; Babak Falsafi

Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the ProtoFlex simulation architecture, which uses FPGAs to accelerate full-system multiprocessor simulation and to facilitate high-performance instrumentation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, ProtoFlex virtualizes the execution of many logical processors onto a consolidated number of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance at a large savings in complexity. Further, to achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the ProtoFlex simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server, hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 38x speedup (and as high as 49×) over comparable software simulation across a suite of applications, including OLTP on a commercial database server. We also demonstrate the advantages of minimal-overhead FPGA-accelerated instrumentation through a CMP cache simulation technique that runs orders-of-magnitude faster than software.

IEEE Transactions on Very Large Scale Integration Systems | 1999

Hardware Synthesis from Term Rewriting Systems

James C. Hoe; Arvind

Term Rewriting System (TRS) is a good formalism for describing concurrent systems that embody asynchronous and nondeterministic behavior in their specifications. Elsewhere, we have used TRS’s to describe speculative micro-architectures and complex cache-coherence protocols, and proven the correctness of these systems. In this paper, we describe the compilation of TRS’s into a subset of Verilog that can be simulated and synthesized using commercial tools. TRAC, Term Rewriting Architecture Compiler, enables a new hardware development framework that can match the ease of today’s software programming environment. TRAC reduces the time and effort in developing and debugging hardware. For several examples, we compare TRAC-generated RTL’s with hand-coded RTL’s after they are both compiled for Field Programmable Gate Arrays by Xilinx tools. The circuits generated from TRS are competitive with those described using Verilog RTL, especially for larger designs.

Explore More