Ayose Falcón | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ayose Falcón is active.

Explore More

Publication

Featured researches published by Ayose Falcón.

ACM Sigarch Computer Architecture News | 2009

How to simulate 1000 cores

Matteo Monchiero; Jung Ho Ahn; Ayose Falcón; Daniel Ortega; Paolo Faraboschi

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

international symposium on computer architecture | 2004

Prophet/Critic Hybrid Branch Prediction

Ayose Falcón; Jared Stark; Alex Ramirez; Konrad K. Lai; Mateo Valero

This paper introduces the prophet/critic hybrid conditional branch predictor, which has two component predictors that play the role of either prophet or critic. The prophet is a conventional predictor that uses branch history to predict the direction of the current branch. Further accesses of the prophet yield predictions for the branches following the current one. Predictions for the current branch and the ones that follow are collectively known as the branchs future. They are actually a prophecy, or predicted branch future. The critic uses both the branchs history and future to give a critique of the prophets prediction fo the current branch. The critique, either agree or disagree, is used to generate the final prediction for the branch. Our results show an 8K + 8K byte prophet/critic hybrid has 39% fewer mispredicts than a 16K byte 2Bc - gskew predictor-a predictor similar to that of the proposed Compaq* Alpha* EV8 processor - across a wide range of applications. The distance between pipeline flushes due to mispredicts increases from one flush per 418 micro-operations (uops) to one per 680 uops. For gcc, the percentage of mispredicted branches drops from 3.11% to 1.23%. On a machine based on the Intel/spl reg/ Pentium/spl reg/ 4 processor, this improves uPC (Uops Per Cycle) by 7.8% (18% for gcc) and reduces the number of uops fetched (along both correct and incorrect paths) by 8.6%.

ieee international conference on high performance computing data and analytics | 2003

Tolerating Branch Predictor Latency on SMT

Ayose Falcón; Oliverio J. Santana; Alex Ramirez; Mateo Valero

Simultaneous Multithreading (SMT) tolerates latency by executing instructions from multiple threads. If a thread is stalled, resources can be used by other threads. However, fetch stall conditions caused by multi-cycle branch predictors prevent SMT to achieve all its potential performance, since the flow of fetched instructions is halted.

ieee international conference on high performance computing data and analytics | 2002

A Comprehensive Analysis of Indirect Branch Prediction

Oliverio J. Santana; Ayose Falcón; Enrique Fernández; Pedro Medina; Alex Ramirez; Mateo Valero

Indirect branch prediction is a performance limiting factor for current computer systems, preventing superscalar processors from exploiting the available ILP. Indirect branches are responsible for 55.7% of mispredictions in our benchmark set, although they only stand for 15.5% of dynamic branches. Moreover, a 10.8% average IPC speedup is achievable by perfectly predicting all indirect branches.The Multi-Stage Cascaded Predictor (MSCP) is a mechanism proposed for improving indirect branch prediction. In this paper, we show that a MSCP can replace a BTB and accurately predict the target address of both indirect and non-indirect branches. We do a detailed analysis of MSCP behavior and evaluate it in a realistic setup, showing that a 5.7% average IPC speedup is achievable.

international symposium on microarchitecture | 2005

Better branch prediction through prophet/critic hybrids

Ayose Falcón; Jared Stark; Alex Ramirez; Konrad K. Lai; Mateo Valero

The prophet/critic hybrid conditional branch predictor has two component predictors. The prophet uses a branchs history to predict its direction. We call this prediction and the ones for branches following it the branch future. The critic uses the branchs history and future to critique the prophets prediction. The hybrid combines the prophets prediction with the critique, either agrees or disagree, forming the branchs overall prediction. Results shows these hybrids can reduce mispredicts by 39 percent and improve processor performance by 7.8 percent.

ieee international symposium on workload characterization | 2009

High-speed network modeling for full system simulation

Diego Lugones; Daniel Franco; Dolores Rexachs; Juan C. Moure; Emilio Luque; Eduardo Argollo; Ayose Falcón; Daniel Ortega; Paolo Faraboschi

The widespread adoption of cluster computing systems has shifted the modeling focus from synthetic traffic to realistic workloads to better capture the complex interactions between applications and architecture. In this context, a full-system simulation environment also needs to model the networking component, but the simulation duration that is practically affordable is too short to appropriately stress the networking bottlenecks. In this paper, we present a methodology that overcomes this problem and enables the modeling of interconnection networks while ensuring representative results with fast simulation turnaround. We use standard network tools to extract simplified models that are statistically validated and at the same time compatible with a full system simulation environment. We propose three models with different accuracy vs. speed ratios that compute network latency times according to the estimated traffic and measure them on a real-world parallel scientific application.

international conference on parallel architectures and compilation techniques | 2006

Branch predictor guided instruction decoding

Oliverio J. Santana; Ayose Falcón; Alex Ramirez; Mateo Valero

Fast instruction decoding is a challenge for the design of CISC microprocessors. A well-known solution to overcome this problem is using a trace cache. It stores and fetches already decoded instructions, avoiding the need for decoding them again. However, implementing a trace cache involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose buffer like the trace cache, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front-end to fetch already decoded instructions from memory instead of the original non-decoded instructions. Our results show that an 8-wide superscalar processor achieves an average 14% performance improvement by using our decoding architecture. This improvement is comparable to the one achieved by using the more complex trace cache, while requiring 16% less chip area and 21% less energy consumption in the fetch architecture.

international parallel and distributed processing symposium | 2005

Effective instruction prefetching via fetch prestaging

Ayose Falcón; Alex Ramirez; Mateo Valero

As technological process shrinks and clock rate increases, instruction caches can no longer be accessed in one cycle. Alternatives are implementing smaller caches (with higher miss rate) or large caches with a pipelined access (with higher branch misprediction penalty). In both cases, the performance obtained is far from the obtained by an ideal large cache with one-cycle access. In this paper we present cache line guided prestaging (CLGP), a novel mechanism that overcomes the limitations of current instruction cache implementations. CLGP employs prefetching to charge future cache lines into a set of fast prestage buffers. These buffers are managed efficiently by the CLGP algorithm, trying to fetch from them as much as possible. Therefore, the number of fetches served by the main instruction cache is highly reduced, and so the negative impact of its access latency on the overall performance. With the best CLGP configuration using a 4 KB I-cache, speedups of 3.5% (at 0.09 /spl mu/m) and 12.5% (at 0.045 /spl mu/m) are obtained over an equivalent fetch directed prefetching configuration, and 39% (at 0.09 /spl mu/m) and 48% (at 0.045 /spl mu/m) over using a pipelined instruction cache without prefetching. Moreover, our results show that CLGP with a 2.5 KB of total cache budget can obtain a similar performance than using a 64 KB pipelined I-cache without prefetching, that is equivalent performance at 6.4X our hardware budget.

ieee international conference on high performance computing data and analytics | 2004

A latency-conscious SMT branch prediction architecture

Ayose Falcón; Oliverio J. Santana; Alex Ramirez; Mateo Valero

Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current simultaneous multithreading (SMT) fetch designs, causing a performance drop due to the absence of instructions to execute. In this paper, we propose several solutions to reduce the effect of branch predictor delay in the performance of SMT processors. Firstly, we analyse the impact of varying the number of access ports. Secondly, we describe a decoupled implementation of an SMT fetch unit that helps to tolerate the predictor delay. Finally, we present an interthread pipelined branch predictor, based on creating a pipeline of interleaved predictions from different threads. Our results show that, combining all the proposed techniques, the performance obtained is similar to that obtained using an ideal, 1-cycle access branch predictor.

ieee international conference on high performance computing data and analytics | 2009

Studying New Ways for Improving Adaptive History Length Branch Predictors

Ayose Falcón; Oliverio J. Santana; Pedro Medina; Enrique Fernández; Alex Ramirez; Mateo Valero

Pipeline stalls due to branches limit processor performance significantly. This paper provides an in depth evaluation of Dynamic History Length Fitting, a technique that changes the history length of a two-level branch predictor during the execution, trying to adapt to its different phases. We analyse the behaviour of DHLF compared with fixed history length gshare predictors, and contribute showing two factors that explain DHLF behaviour: Opportunity Cost and Warm-up Cost. Additionally, we evaluate the use of profiling for detecting future improvements. Using this information, we show that new heuristics that minimise both opportunity cost and warm-up cost could outperform significantly current variable history length techniques. Especially at program start-up, where the algorithm tries to learn the behaviour of the program to better predict future branches, the use of profiling reduces considerably the cost produced by continuous history length changes.

Explore More