Daniel Ortega | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Ortega is active.

Explore More

Publication

Featured researches published by Daniel Ortega.

high-performance computer architecture | 2004

Out-of-order commit processors

Adrian Cristal; Daniel Ortega; Josep Llosa; Mateo Valero

Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.

ACM Sigarch Computer Architecture News | 2009

How to simulate 1000 cores

Matteo Monchiero; Jung Ho Ahn; Ayose Falcón; Daniel Ortega; Paolo Faraboschi

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

international symposium on computer architecture | 2004

A Content Aware Integer Register File Organization

Gonzalez Gonzalez; Adrian Cristal; Daniel Ortega; Alexander V. Veidenbaum; Mateo Valero

A register file is a critical component of a modern superscalar processor. It has a large number of entries and read/write ports in order to enable high levels of instruction parallelism. As a result, the register files area, access time, and energy consumption increase dramatically, significantly affecting the overall superscalar processors performance and energy consumption. This is especially true in 64-bit processors. This paper presents a new integer register file organization, which reduces energy consumption, area, and access time of the register file with a minimal effect on overall IPC. This is accomplished by exploiting a new concept, partial value locality, which is defined as occurrence of multiple live value instances identical in a subset of their bits. A possible implementation of the new register file is described and shown to obtain proposed optimized register file designs. Overall, an energy reduction of over 50%, a 18% decrease in area, and a 15% reduction in the access time are achieved in the new register file. The energy and area savings are achieved with a 1.7% reduction in IPC for integer applications and a negligible 0.3% in numerical applications, assuming the same clock frequency. A performance increase of up to 13% is possible if the clock frequency can be increases due to a reduction in the register file access time. This approach enables other, very promising optimizations, three of which are outlined in the paper.

international conference on parallel architectures and compilation techniques | 2002

Cost-effective compiler directed memory prefetching and bypassing

Daniel Ortega; Eduard Ayguadé; Jean-Loup Baer; Mateo Valero

Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper is a hybrid approach to the prefetching problem that combines both software and hardware prefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline. The prefetcher is built on-top of a static memory instruction bypassing, which is in charge of bringing prefetched values in the register file. In this paper we also present a thorough analysis of the limits of both prefetching and memory instruction bypassing. We also compare our prefetching technique with a prior speculative proposal that attacked the same problem, and we show that at much lower cost, our hybrid solution is better than a realistic implementation of speculative prefetching and bypassing. On average, our hybrid implementation achieves a 13% speed-up improvement over a version with software prefetching in a subset of numerical applications and an average of 43% over a version with no software prefetching (achieving up to a 102% for specific benchmarks).

ieee international conference on high performance computing data and analytics | 2003

Kilo-instruction Processors

Adrian Cristal; Daniel Ortega; Josep Llosa; Mateo Valero

Due to the difference between processor speed and memory speed, the latter has steadily appeared further away in cycles to the processor. Superscalar out-of-order processors cope with these increasing latencies by having more in-flight instructions from where to extract ILP. With coming latencies of 500 cycles and more, this will eventually derive in what we have called Kilo-Instruction Processors, which will have to handle thousands of in-flight instructions. Managing such a big number of in-flight instructions must imply a microarchitectural change in the way the re-order buffer, the instructions queues and the physical registers are handled, since simply up-sizing these resources is technologically unfeasible. In this paper we present a survey of several techniques which try to solve these problems caused by thousands of in-flight instructions.

document engineering | 2005

Document digitization lifecycle for complex magazine collection

Sherif Yacoub; John Burns; Paolo Faraboschi; Daniel Ortega; Jose Abad Peiro; Vinay Saxena

The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.

international conference on supercomputing | 1999

Increasing effective IPC by exploiting distant parallelism

Iván Martel; Daniel Ortega; Eduard Ayguadé; Mateo Valero

The main objective of compiler and processor designers is to effectively exploit the instruction-level parallelism (ILP) available in applications. Although most of the times their research activities have been conducted separately, we believe that a stronger co-operation between them will make effective the increase of potential ILP coming from future architectures. Nowadays, most computer architecture achievements proceed towards the overcoming of the hurdle imposed by dependencies in the code, by means of extracting parallelism from large instruction windows. However, implementation constraints limit the size of this window and therefore the visibility of the program structure at run-time. In this paper we show the existence of distant parallelism that future compilers could detect. By distant parallelism we mean parallelism that can not be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. Although this parallelism also exists in numerical applications (going far beyond classical loop parallelism and usually known as task parallelism), we focus on non-numerical applications, where the data and computation structures make difficult the detection of concurrent threads of execution. Some preliminary but encouraging results are presented in the paper, reporting speed-ups in the range of 1.2 to 2.65. These results seem promising and want to show a new insight in the detection of threads for current and future multithreaded architectures. It is important to notice at this point that the benefits described herein are totally orthogonal to any other architectural techniques targeting a single thread.

international conference on supercomputing | 2001

A novel renaming mechanism that boosts software prefetching

Daniel Ortega; Mateo Valero; Eduard Ayguadé

The detection and correct handling of data and control dependencies constitutes one of the biggest issues to expose ILP in current architectures. The ever increasing memory latencies and working space of programmes are making prefetching techniques crucial for the attainment of sustained high performance. Software prefetching allows the compiler to use information discovered at compile-time to effectively bring needed data before it is used, thus hiding all or part of the latency from main memory. On the other hand, renaming is a technique that allows the hardware to break register naming dependencies, thus exposing more parallelism to the hardware. In this paper we will present a new compiler-directed renaming mechanism focused on prefetch instructions. The compiler informs the hardware on the association of prefetch and load instructions, thus making it possible for the hardware to convert non-binding prefetches in to binding prefetches, without any of the compile-time limitations this other kind of prefetching may have. The mechanism can be implemented at a very low costin terms of area and we believe it will not impact cycle time. The research presented in this paper is at a first stage; nevertheless, our results for a set of numerical application show a speedup of 5% to 22%, and in any case no performance degradation was observed.

ieee international symposium on workload characterization | 2009

High-speed network modeling for full system simulation

Diego Lugones; Daniel Franco; Dolores Rexachs; Juan C. Moure; Emilio Luque; Eduardo Argollo; Ayose Falcón; Daniel Ortega; Paolo Faraboschi

The widespread adoption of cluster computing systems has shifted the modeling focus from synthetic traffic to realistic workloads to better capture the complex interactions between applications and architecture. In this context, a full-system simulation environment also needs to model the networking component, but the simulation duration that is practically affordable is too short to appropriately stress the networking bottlenecks. In this paper, we present a methodology that overcomes this problem and enables the modeling of interconnection networks while ensuring representative results with fast simulation turnaround. We use standard network tools to extract simplified models that are statistically validated and at the same time compatible with a full system simulation environment. We propose three models with different accuracy vs. speed ratios that compute network latency times according to the estimated traffic and measure them on a real-world parallel scientific application.

international conference on parallel architectures and compilation techniques | 1999

Quantifying the Benefits of SPECint Distant Parallelism in Simultaneous Multi-Threading Architectures

Daniel Ortega; Iván Martel; Eduard Ayguadé; Mateo Valero; Venkata Krishnan

We exploit the existence of distant parallelism that future compilers could detect and characterise its performance under simultaneous multithreading architectures. By distant parallelism we mean parallelism that cannot be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. We show that distant parallelism can make feasible wider issue processors by providing more instructions from the distant threads, thus better exploiting the resources from the processor in the case of speeding up single integer applications. We also investigate the necessity of out-of-order processors in the presence of multiple threads of the same program. It is important to notice at this point that the benefits described are totally orthogonal to any other architectural techniques targeting a single thread.

Explore More