Is this you? Create Your Porfile

Jesus Corbal

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jesus Corbal is active.

Explore More

Publication

Featured researches published by Jesus Corbal.

IEEE Transactions on Computers | 2005

Fuzzy memoization for floating-point multimedia applications

Carlos Álvarez; Jesus Corbal; Mateo Valero

Instruction memoization is a promising technique to reduce the power consumption and increase the performance of future low-end/mobile multimedia systems. Power and performance efficiency can be improved by reusing instances of an already executed operation. Unfortunately, this technique may not always be worth the effort due to the power consumption and area impact of the tables required to leverage an adequate level of reuse. In this paper, we introduce and evaluate a novel way of understanding multimedia floating-point operations based on the fuzzy computation paradigm: performance and power consumption can be improved at the cost of small precision losses in computation. By exploiting this implicit characteristic of multimedia applications, we propose a new technique called tolerant memoization. This technique expands the capabilities of classic memoization by associating entries with similar inputs to the same output. We evaluate this new technique by measuring the effect of tolerant memoization for floating-point operations in a low-power multimedia processor and discuss the trade-offs between performance and quality of the media outputs. We report energy improvements of 12 percent for a set of key multimedia applications with small LUT of 6 Kbytes, compared to 3 percent obtained using previously proposed techniques.

international symposium on microarchitecture | 1999

Exploiting a new level of DLP in multimedia applications

Jesus Corbal; Mateo Valero; Roger Espasa

This paper proposes and evaluates MOM: a novel ISA paradigm targeted at multimedia applications. By fusing conventional vector ISA approaches together with more recent SIMD-like (Single Instruction Multiple Data) ISAs (such as MMX), we have developed a new matrix oriented ISA which efficiently deals with the small matrix structures typically found in multimedia applications. MOM exploits a level of DLP not reachable by neither conventional vector ISAs nor SIMD-like media ISA extensions. Our results show that MOM provides a factor of 1.3x to 4x performance improvement when compared with two different multimedia extensions (MMX and MDMX) on several kernels, which translates into up to a 50% of performance gain when measuring full applications (20% in average). Furthermore, the streaming nature of MOM provides additional advantages for executing multimedia applications, such as a very low fetch pressure or a high tolerance to memory latency, making MOM an ideal candidate for the embedded domain.

international conference on parallel architectures and compilation techniques | 1998

Command vector memory systems: high performance at low cost

Jesus Corbal; Roger Espasa; Mateo Valero

The focus of this paper is on designing both a low cost and high performance, high bandwidth vector memory system that takes advantage of modern commodity SDRAM memory chips. To successfully extract the full bandwidth from SDRAM parts, we propose a new memory system organization based on sending commands to the memory system as opposed to sending individual addresses. A command specifies, in a few bytes, a request for multiple independent memory words. A command is similar to a burst found in DRAM memories, but does not require the memory words to be consecutive. The command is sent to all sections of the memory array simultaneously, thus not requiring a crossbar in the proper sense. Our simulations show that this command based memory system can improve performance over a traditional SDRAM-based memory system by factors that range between 1.15 up to 1.54. Moreover, in many cases, the command memory system outperforms even the best SRAM memory system under consideration. Overall the command based memory system achieves similar or better results than a 10 ns SRAM memory system (a) using fewer banks and (b) using memory devices that are between 15 to 60 times cheaper.

international conference on supercomputing | 1999

Adding a vector unit to a superscalar processor

Francisca Quintana; Jesus Corbal; Roger Espasa; Mateo Valero

The focus of this paper is on adding a vector unit to a superscalar core, as a way to scale current state of the art superscalar processors. The proposed architecture has a vector register file that shares functional units both with the integer datapath and with the floatingpoint datapath. A key point in our proposal is the design of a high performance cache interface that delivers high bandwidth to the vector unit at a low cost and low latency. We propose a double-banked cache with alignment circuitry to serve vector accesses and we study two cache hierarchies: one feeds the uector unit from the Ll; the other from the L.2. Our results show that large IPC values (higher than IO in some cases) can be achieved. Moreover the scalability of our architecture simply requires addition of functional units, without requiring more issue bandwidth. As a consequence, the proposed vector unit achieves high performance for numerical and multimedia codes with minimal impact on the cycle time of the processor or on the performance of integer codes.

international symposium on microarchitecture | 2003

Design and implementation of high-performance memory systems for future packet buffers

Jorge García; Jesus Corbal; Llorenç Cerdà; Mateo Valero

In this paper, we address the design of a future high-speed router that supports line rates as high as OC-3072 (160 Gb/s), around one hundred ports and several service classes. Building such a high-speed router would raise many technological problems, one of them being the packet buffer design, mainly because in router design it is important to provide worst-case bandwidth guarantees and not just average-case optimizations. A previous packet buffer design provides worst-case bandwidth guarantees by using a hybrid SRAM/DRAM approach. Next-generation routers need to support hundreds of interfaces (i.e., ports and service classes). Unfortunately, high bandwidth for hundreds of interfaces requires the previous design to use large SRAMs which become a bandwidth bottleneck. The key observation we make is that the SRAM size is proportional to the DRAM access time but we can reduce the effective DRAM access time by overlapping multiple accesses to different banks, allowing us to reduce the SRAM size. The key challenge is that to keep the worst-case bandwidth guarantees, we need to guarantee that there are no bank conflicts while the accesses are in flight. We guarantee bank conflicts by reordering the DRAM requests using a modern issue-queue-like mechanism. Because our design may lead to fragmentation of memory across packet buffer queues, we propose to share the DRAM space among multiple queues by renaming the queue slots. To the best of our knowledge, the design proposed in this paper is the fastest buffer design using commodity DRAM to be published to date.

IEEE Transactions on Computers | 2006

A DRAM/SRAM memory scheme for fast packet buffers

Jorge García-Vidal; Maribel March; Llorenç Cerdà; Jesus Corbal; Mateo Valero

We address the design of high-speed packet buffers for Internet routers. We use a general DRAM/SRAM architecture for which previous proposals can be seen as particular cases. For this architecture, large SRAMs are needed to sustain high line rates and a large number of interfaces. A novel algorithm for DRAM bank allocation is presented that reduces the SRAM size requirements of previously proposed schemes by almost an order of magnitude, without having memory fragmentation problems. A technological evaluation shows that our design can support thousands of queues for line rates up to 160 Gbps.

international symposium on microarchitecture | 2002

Three-dimensional memory vectorization for high bandwidth media memory systems

Jesus Corbal; Roger Espasa; Mateo Valero

Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SIMD functional units. This paper addresses the problem of the memory bandwidth. We propose a novel mechanism suitable for 2-dimensional vector architectures and targeted at providing high effective bandwidth for SIMD memory instructions. The basis of this mechanism is the extension of the scope of vectorization at the memory level, so that 3-dimensional memory patterns can be fetched into a second-level register file. By fetching long blocks of data and by reusing 2-dimensional memory streams at this second-level register file, we obtain a significant increase in the effective memory bandwidth. As side benefits, the new 3-dimensional load instructions provide a high robustness to memory latency and a significant reduction of the cache activity, thus reducing power and energy requirements. At the investment of a 50% more area than a regular SIMD register file, we have measured and average speed-up of 13% and the potential for power savings in the L2 cache of a 30%.

international conference on supercomputing | 2001

On the potential of tolerant region reuse for multimedia applications

Carlos Álvarez; Jesus Corbal; Esther Salamí; Mateo Valero

The recent years have shown an interesting evolution in the mid-end to low-end embedded domain. Portable systems are growing in importance as they improve in storage capacity and in interaction capabilities with general purpose systems. Furthermore, media processing is changing the view embedded processors are designed, keeping in mind the emergence of new application domains such as those for PDA systems or for the third generation of mobile digital phones (UMTS). The performance requirements of these new kind of devices are not those of the general-purpose domain, where traditionally the premium goal is the highest performance. Embedded systems must face ever increasing real time requirements as well as power consumption constraints. Under this special scenario, instruction/region reuse arises as a promising way of increasing the performance of media embedded processors and, at the same time, reducing the power consumption. Furthermore, media and signal processing applications are a suitable target for instruction/region reuse, given the large amount of redundancy found in media data working sets. In this paper we propose a novel region reuse mechanism that takes advantage of the tolerance of media algorithms to losses in the precision of computation. By identifying regions of code where an input data set is processed into an output data set, we can reuse computational instances using the result of previous ones with a similar input data set (hence the term tolerant reuse). We will show that conventional region reuse is barely able to provide more than a 8% in reduction of executed instructions (even with significantly big tables) in a typical JPEG encoder application. On the other hand, when applying the concept of tolerance, we are able to provide a reduction of more than 25% of the number of executed instructions with tables smaller than 1KB (with only small degradations in the quality of the output image), and up to a 40% of reduction (and no visually perceptible differences) with bigger tables .

international conference on parallel architectures and compilation techniques | 2001

On the efficiency of reductions in /spl mu/-SIMD media extensions

Jesus Corbal; Roger Espasa; Mateo Valero

Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for /spl mu/-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in /spl mu/-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current /spl mu/-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of /spl mu/-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.

global communications conference | 2003

A conflict-free memory banking architecture for fast VOQ packet buffers

Jorge García; Llorenç Cerdà; Jesus Corbal; Mateo Valero

In order to support the enormous growth of the Internet, innovative research in every router subsystem is needed. We focus our attention on packet buffer design for routers supporting high-speed line rates. More specifically, we address the design of packet buffers using virtual output queuing (VOQ), which are used in most modern router architectures. The design is based on a previously proposed scheme that uses a combination of SRAM and DRAM modules. We propose a storage scheme that achieves a conflict-free memory bank organization. This leads to a reduction of the granularity of DRAM accesses, resulting in a decrease of storage capacity needed by the SRAM. In the DRAM/SRAM scheme, SRAM memory bandwidth needs to fit the line rate. Since memory bandwidth is limited by its size, searching for memory schemes having a small SRAM size arises as an essential issue for high speed line rates (e.g. OC768, 40 Gbps and OC3072, 160 Gbps).

Explore More