Daniel A. Jiménez | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel A. Jiménez is active.

Explore More

Publication

Featured researches published by Daniel A. Jiménez.

high performance computer architecture | 2001

Dynamic branch prediction with perceptrons

Daniel A. Jiménez; Calvin Lin

This paper presents a new method for branch prediction. The key idea is to use one of the simplest possible neural networks, the perceptron, as an alternative to the commonly used two-bit counters. Our predictor achieves increased accuracy by making use of long branch histories, which are possible becasue the hardware resources for our method scale linearly with the history length. By contrast, other purely dynamic schemes require exponential resources. We describe our design and evaluate it with respect to two well known predictors. We show that for a 4K byte hardware budget our method improves misprediction rates for the SPEC 2000 benchmarks by 10.1% over the gshare predictor. Our experiments also provide a better understanding of the situations in which traditional predictors do and do not perform well. Finally, we describe techniques that allow our complex predictor to operate in one cycle.

international symposium on microarchitecture | 2000

The impact of delay on the design of branch predictors

Daniel A. Jiménez; Stephen W. Keckler; Calvin Lin

Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetch bandwidth that is sufficient for wide out-of-order execution cores. While existing predictors can still be accessed in a single clock cycle, recent studies show that slower wires and faster clock rates will require multi-cycle access times to large on-chip structures, such as branch prediction tables. Thus, future branch predictors must consider not only area and accuracy, but also delay. The paper explores these tradeoffs in designing branch predictors and shows that increased accuracy alone cannot overcome the penalties in delay that arise with larger predictor structures. We evaluate three schemes for accommodating delay: a caching approach, an overriding approach, and a cascading lookahead approach. While we use a common branch predictor, gshare, as the prediction component, these schemes can be constructed using most types of predictors.

international symposium on microarchitecture | 2010

Sampling Dead Block Prediction for Last-Level Caches

Samira Manabi Khan; Yingying Tian; Daniel A. Jiménez

Last-level caches (LLCs) are large structures with significant power requirements. They can be quite inefficient. On average, a cache block in a 2MB LRU-managed LLC is dead 86% of the time, i.e., it will not be referenced again before it is evicted. This paper introduces sampling dead block prediction, a technique that samples program counters (PCs) to determine when a cache block is likely to be dead. Rather than learning from accesses and evictions from every set in the cache, a sampling predictor keeps track of a small number of sets using partial tags. Sampling allows the predictor to use far less state than previous predictors to make predictions with superior accuracy. Dead block prediction can be used to drive a dead block replacement and bypass optimization. A sampling predictor can reduce the number of LLC misses over LRU by 11.7% for memory-intensive single-thread benchmarks and 23% for multi-core workloads. The reduction in misses yields a geometric mean speedup of 5.9% for single-thread benchmarks and a geometric mean normalized weighted speedup of 12.5% for multi-core workloads. Due to the reduced state and number of accesses, the sampling predictor consumes only 3.1% of the of the dynamic power and 1.2% of the leakage power of a baseline 2MB LLC, comparing favorably with more costly techniques. The sampling predictor can even be used to significantly improve a cache with a default random replacement policy.

international symposium on microarchitecture | 2003

Fast path-based neural branch prediction

Daniel A. Jiménez

Microarchitectural prediction based on neural learning has received increasing attention in recent years. However, neural prediction remains impractical because its superior accuracy over conventional predictors is not enough to offset the cost imposed by its high latency. We present a new neural branch predictor that solves the problem from both directions: it is both more accurate and much faster than previous neural predictors. Our predictor improves accuracy by combining path and pattern history to overcome limitations inherent to previous predictors. It also has much lower latency than previous neural predictors. The result is a predictor with accuracy for superior to conventional predictors but with latency comparable to predictors from industrial designs. Our simulations show that a path-based neural predictor improves the instructions-per-cycle (IPC) rate of an aggressively clocked microarchitecture by 16% over the original perceptron predictor.

international symposium on neural networks | 1998

Dynamically weighted ensemble neural networks for classification

Daniel A. Jiménez

Combining the outputs of several neural networks into an aggregate output often gives improved accuracy over any individual output. The set of networks is known as an ensemble or committee. This paper presents an ensemble method for classification that has advantages over other techniques for linear combining. Normally, the output of an ensemble is a weighted sum whose weights are fixed, having been determined from the training or validation data. Our ensembles are weighted dynamically, the weights determined from the respective certainties of the network outputs. The more certain a network seems to be of its decision, the higher the weight.

international symposium on computer architecture | 2005

Piecewise Linear Branch Prediction

Daniel A. Jiménez

Improved branch prediction accuracy is essential to sustaining instruction throughput with todays deep pipelines. We introduce piecewise linear branch prediction, an idealized branch predictor that develops a set of linear functions, one for each program path to the branch to be predicted, that separate predicted taken from predicted not taken branches. Taken together, all of these linear functions form a piecewise linear decision surface. We present a limit study of this predictor showing its potential to greatly improve predictor accuracy. We then introduce a practical implementable branch predictor based on piecewise linear branch prediction. In making our predictor practical, we show how a parameterized version of it unifies the previously distinct concepts of perceptron prediction and path-based neural prediction. Our new branch predictor has implementation costs comparable to current prominent predictors in the literature while significantly improving accuracy. For a deeply pipelined simulated microarchitecture our predictor with a 256 KB hardware budget improves the harmonic mean normalized instructions-per-cycle rate by 8% over both the original path-based neural predictor and 2Bc-gskew. The average misprediction rate is decreased by 16% over the path-based neural predictor and by 22% over 2Bc-gskew.

high-performance computer architecture | 2003

Reconsidering complex branch predictors

Daniel A. Jiménez

To sustain instruction throughput rates in more aggressively clocked microarchitectures, microarchitects have incorporated larger and more complex branch predictors into their designs, taking advantage of the increasing numbers of transistors available on a chip. Unfortunately, because of penalties associated with their implementations, the extra accuracy provided by many branch predictors does not produce a proportionate increase in performance. Specifically, we show that the techniques used to hide the latency of a large and complex branch predictor do not scale well and will be unable to sustain IPC for deeper pipelines. We investigate a different way to build large branch predictors. We propose an alternative predictor design that completely hides predictor latency so that accuracy and hardware budget are the only factors that affect the efficiency of the predictor. Our simple design allows the predictor to be pipelined efficiently by avoiding difficulties introduced by complex predictors. Because this predictor eliminates the penalties associated with complex predictors, overall performance exceeds that of even the most accurate known branch predictors in the literature at large hardware budgets. We conclude that as chip densities increase in the next several years, the accuracy of complex branch predictors must be weighed against the performance benefits of simple branch predictors.

international conference on parallel architectures and compilation techniques | 2010

Using dead blocks as a virtual victim cache

Samira Manabi Khan; Doug Burger; Daniel A. Jiménez; Babak Falsafi

Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.

high-performance computer architecture | 2014

Adaptive placement and migration policy for an STT-RAM-based hybrid cache

Zhe Wang; Daniel A. Jiménez; Cong Xu; Guangyu Sun; Yuan Xie

Emerging Non-Volatile Memories (NVM) such as Spin-Torque Transfer RAM (STT-RAM) and Resistive RAM (RRAM) have been explored as potential alternatives for traditional SRAM-based Last-Level-Caches (LLCs) due to the benefits of higher density and lower leakage power. However, NVM technologies have long latency and high energy overhead associated with the write operations. Consequently, a hybrid STT-RAM and SRAM based LLC architecture has been proposed in the hope of exploiting high density and low leakage power of STT-RAM and low write overhead of SRAM. Such a hybrid cache design relies on an intelligent block placement policy that makes good use of the characteristics of both STT-RAM and SRAM technology. In this paper, we propose an adaptive block placement and migration policy (APM) for hybrid caches. LLC write accesses are categorized into three classes: prefetch-write, demand-write, and core-write. Our proposed technique places a block into either STT-RAM lines or SRAM lines by adapting to the access pattern of each class. An access pattern predictor is proposed to direct block placement and migration, which can benefit from the high density and low leakage power of STT-RAM lines as well as the low write overhead of SRAM lines. Our evaluation shows that the technique can improve performance and reduce LLC power consumption compared to both SRAM-based LLC and STT-RAM-based LLCs with the same area footprint. It outperforms the SRAM-based LLC on average by 8.0% for single-thread workloads and 20.5% for multi-core workloads. The technique reduces power consumption in the LLC by 18.9% and 19.3% for single-thread and multi-core workloads, respectively.

international conference on parallel architectures and compilation techniques | 2007

A Flexible Heterogeneous Multi-Core Architecture

Miquel Pericàs; Adrián Cristal; Francisco J. Cazorla; Ruben Gonzalez; Daniel A. Jiménez; Mateo Valero

Multi-core processors naturally exploit thread-level parallelism (TLP). However, extracting instruction-level parallelism (ILP) from individual applications or threads is still a challenge as application mixes in this environment are nonuniform. Thus, multi-core processors should be flexible enough to provide high throughput for uniform parallel applications as well as high performance for more general workloads. Heterogeneous architectures are a first step in this direction, but partitioning remains static and only roughly fits application requirements. This paper proposes the Flexible Heterogeneous Mul-tiCore processor (FMC), the first dynamic heterogeneous multi-core architecture capable of reconfiguring itself to fit application requirements without programmer intervention. The basic building block of this microarchitecture is a scalable, variable-size window microarchitecture that exploits the concept of Execution Locality to provide large-window capabilities. This allows to overcome the memory wall for applications with high memory-level parallelism (MLP). The microarchitecture contains a set of small and fast cache processors that execute high locality code and a network of small in-order memory engines that together exploit low locality code. Single-threaded applications can use the entire network of cores while multi-threaded applications can efficiently share the resources. The sizing of critical structures remains small enough to handle current power envelopes. In single-threaded mode this processor is able to outperform previous state-of-the-art high-performance processor research by 12% on SpecFP. We show how in a quad- threaded/quad-core environment the processor outperforms a statically allocated configuration in both throughput and harmonic mean, two commonly used metrics to evaluate SMTperformance, by around 2-4%. This is achieved while using a very simple sharing algorithm.

Explore More