Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jose Renau is active.

Publication


Featured researches published by Jose Renau.


international symposium on microarchitecture | 2002

Cherry: Checkpointed early resource recycling in out-of-order microprocessors

Jose F. Martinez; Jose Renau; Michael C. Huang; Milos Prvulovic; Josep Torrellas

This paper presents checkpointed early resource recycling (Cherry), a hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and instruction retirement. Resources are recycled early, resulting in a more efficient utilization. Cherry relies on state checkpointing and rollback to service exceptions for instructions whose resources have been recycled. Cherry leverages the ROB to (1) not require in-order execution as a fallback mechanism, (2) allow memory replay traps and branch mispredictions without rolling back to the Cherry checkpoint, and (3) quickly fall back to conventional out-of-order execution without rolling back to the checkpoint or flushing the pipeline. We present a Cherry implementation with early recycling at three different points of the execution engine: the load queue, the store queue, and the register file. We report average speedups of 1.06 and 1.26 in SPECint and SPECfp applications, respectively, relative to an aggressive conventional architecture. We also describe how Cherry and speculative multithreading can be combined and complement each other.


acm sigplan symposium on principles and practice of parallel programming | 2006

POSH: a TLS compiler that exploits program structure

Wei Liu; James Tuck; Luis Ceze; Wonsun Ahn; Karin Strauss; Jose Renau; Josep Torrellas

As multi-core architectures with Thread-Level Speculation (TLS) are becoming better understood, it is important to focus on TLS compilation. TLS compilers are interesting in that, while they do not need to fully prove the independence of concurrent tasks, they make choices of where and when to generate speculative tasks that are crucial to overall TLS performance.This paper presents POSH, a new, fully automated TLS compiler built on top of gcc. POSH is based on two design decisions. First, to partition the code into tasks, it leverages the code structures created by the programmer, namely subroutines and loops. Second, it uses a simple profiling pass to discard ineffective tasks. With the code generated by POSH, a simulated TLS chip multiprocessor with 4 superscalar cores delivers an average speedup of 1.30 for the SPECint 2000 applications. Moreover, an estimated 26% of this speedup is a result of the implicit data prefetching provided by squashed tasks.


international symposium on computer architecture | 2003

Positional adaptation of processors: application to energy reduction

Michael C. Huang; Jose Renau; Josep Torrellas

Although adaptive processors can exploit application variability to improve performance or save energy, effectively managing their adaptivity is challenging. To address this problem, we introduce a new approach to adaptivity: the Positional approach. In this approach, both the testing of configurations and the application of the chosen configurations are associated with particular code sections. This is in contrast to the currently-used Temporal approach to adaptation, where both the testing and application of configurations are tied to successive intervals in time.We propose to use subroutines as the granularity of code sections in positional adaptation. Moreover, we design three implementations of subroutine-based positional adaptation that target energy reduction in three different workload environments: embedded or specialized server, general purpose, and highly dynamic. All three implementations of positional adaptation are much more effective than temporal schemes. On average, they boost the energy savings of applications by 50% and 84% over temporal schemes in two experiments.


international symposium on microarchitecture | 2000

A framework for dynamic energy efficiency and temperature management

Michael C. Huang; Jose Renau; Seung-Moon Yoo; Josep Torrellas

While technology is delivering increasingly sophisticated and powerful chip designs, it is also imposing alarmingly high energy requirements on the chips. One way to address this problem is to manage the energy dynamically. Unfortunately, current dynamic schemes for energy management are relatively limited. In addition, they manage energy either for energy efficiency or for temperature control, but not for both simultaneously. In this paper, we design and evaluate for the first time an energy-management framework that tackles both energy efficiency and temperature control in a unified manner. We call this general approach Dynamic Energy Efficiency and Temperature Management (DEETM). Our framework combines several energy-management techniques and can activate them individually or in groups in a fine-grained manner according to a given policy. The goal of the framework is two-fold: maximize energy savings without extending application execution time beyond a given tolerable limit, and guarantee that the temperature remains below a given limit while minimizing any resulting slowdown. The framework successfully meets these goals. For example, it delivers a 40% energy reduction with a 10% application slowdown.


international conference on supercomputing | 2005

Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation

Jose Renau; James Tuck; Wei Liu; Luis Ceze; Karin Strauss; Josep Torrellas

Chip Multiprocessors (CMPs) are flexible, high-frequency platforms on which to support Thread-Level Speculation (TLS). However, for TLS to deliver on its promise, CMPs must exploit multiple sources of speculative task-level parallelism, including any nesting levels of both subroutines and loop iterations. Unfortunately, these environments are hard to support in decentralized CMP hardware: since tasks are spawned out-of-order and unpredictably, maintaining key TLS basics such as task ordering and efficient resource allocation is challenging.While the concept of out-of-order spawning is not new, this paper is the first to propose a set of microarchitectural mechanisms that, altogether, fundamentally enable fast TLS with out-of-order spawn in a CMP. Moreover, we develop a fully-automated TLS compiler for aggressive out-of-order spawn. With our mechanisms, a TLS CMP with four 4-issue cores achieves an average speedup of 1.30 for full SPECint 2000 applications; the corresponding speedup for in-order only spawn is 1.04. Overall, our mechanisms unlock the potential of TLS for the toughest applications.


international symposium on low power electronics and design | 2001

L1 data cache decomposition for energy efficiency

Michael C. Huang; Jose Renau; Seung-Moon Yoo; Josep Torrellas

The L1 data cache is a time-critical module and, at the same time a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of the cache structure and break the cache down into smaller caches. To this end, we propose a new L1 data cache structure that combines a Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC). Individually, our SSC and PSAC designs have a lower energy-delay product than previously-proposed related designs. In addition, their combined operation is very effective. Relative to a conventional 2-way 32 KB data cache, a design containing a 4-way 32 KB PSAC and a 512 B SSC reduces the energy-delay product of several applications by an average of 44%.


high-performance computer architecture | 2013

ESESC: A fast multicore simulator using Time-Based Sampling

Ehsan K. Ardestani; Jose Renau

Architects rely on simulation in their exploration of the design space. However, slow simulation speed caps their productivity and limits the depth of their exploration. Sampling has been a commonly used remedy. While sampling is shown to be an effective technique for single core processors, its application has been limited to simulation of multi-program, throughput applications only. This work presents Time-Based Sampling (TBS), a framework that is the first to enable sampling in simulation of multicore processors with virtually no limitation in terms of application type (multi-programmed or multithreaded), number of cores, homogeneity or heterogeneity of the simulated configuration (4.99% error averaged across all the evaluated configurations). TBS also is the first to enable integrated power and temperature evaluation in statistically sampled simulation of multicore systems (with 5.5% and 2.4% error on average, respectively). We implement an architectural simulator based on TBS, called ESESC, that provides a holistic set of tools for a fair evaluation of different architectures.


international symposium on computer architecture | 2007

Power model validation through thermal measurements

Francisco J. Mesa-Martinez; Joseph Nayfach-Battilana; Jose Renau

Simulation environments are an indispensable tool in the design, prototyping, performance evaluation, and analysis of computer systems. Simulator must beable to faithfully reflect the behavior of the system being analyzed. To ensure the accuracy of the simulator, it must be verified and determined to closely match empirical data. Modern processors provide enough performance counters to validate the majority of the performance models; nevertheless, the information provided is not enough to validate power and thermal models. In order to address some of the difficulties associated with the validation of power andthermal models, this paper proposes an infrared measurement setup to capture run-time power consumption and thermal characteristics of modern chips. We use infrared cameras with high spatial resolution (10x10μm) and high frame rate (125fps) to capture thermal maps. To generate a detailed power breakdown (leakage and dynamic) for each processor floorplan unit, we employ genetic algorithms. The genetic algorithm finds a power equation for each floorplan block that produces the measured temperature for a given thermal package. The difference between the predicted power and the externally measured power consumption for an AMD Athlon analyzed in this paper has less than 1% discrepancy. As an example of applicability, we compare the obtained measurements with CACTI power models, and propose extensions to existing thermal models to increase accuracy.


architectural support for programming languages and operating systems | 2010

Characterizing processor thermal behavior

Francisco J. Mesa-Martinez; Ehsan K. Ardestani; Jose Renau

Temperature is a dominant factor in the performance, reliability, and leakage power consumption of modern processors. As a result, increasing numbers of researchers evaluate thermal characteristics in their proposals. In this paper, we measure a real processor focusing on its thermal characterization executing diverse workloads. Our results show that in real designs, thermal transients operate at larger scales than their performance and power counterparts. Conventional thermal simulation methodologies based on profile-based simulation or statistical sampling, such as Simpoint, tend to explore very limited execution spans. Short simulation times can lead to reduced matchings between performance and thermal phases. To illustrate these issues we characterize and classify from a thermal standpoint SPEC00 and SPEC06 applications, which are traditionally used in the evaluation of architectural proposals. This paper concludes with a list of recommendations regarding thermal modeling considerations based on our experimental insights.


international symposium on low power electronics and design | 2002

Energy-efficient hybrid wakeup logic

Michael C. Huang; Jose Renau; Josep Torrellas

The instruction window is a critical component and a major energy consumer in out-of-order superscalar processors. An important source of energy consumption in the instruction window is the instruction wakeup: a completing instruction broadcasts its result register tag and an associative comparison is performed with all the entries in the window.This paper shows that a very large fraction of the completing instructions have to wake up no more than a single instruction currently in the window. Consequently, we propose to save energy by using indexing to only enable the comparator at the single instruction to wake up. Only in the rare case when more than one instruction needs to wake up, our scheme reverts to enabling all the comparators or a subset of them. For this reason, we call our scheme Hybrid. Overall, our scheme is very effective: for a processor with a 96-entry window, the number of comparisons performed by the average completing instruction with a destination register is reduced to 0.8. The exact magnitude of the energy savings will depend on the specific instruction window implementation. Furthermore, the application suffers no performance penalty.

Collaboration


Dive into the Jose Renau's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Elnaz Ebrahimi

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Luis Ceze

University of Washington

View shared research outputs
Top Co-Authors

Avatar

Cyrus Bazeghi

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Michael C. Huang

University of Illinois at Urbana–Champaign

View shared research outputs
Researchain Logo
Decentralizing Knowledge