Is this you? Create Your Porfile

Javier Zalamea

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Javier Zalamea is active.

Explore More

Publication

Featured researches published by Javier Zalamea.

international symposium on microarchitecture | 2000

Two-level hierarchical register file organization for VLIW processors

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

High-performance microprocessors are currently designed to exploit the inherent instruction level parallelism (ILP) available in most applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. If more registers than those available in the architecture are required, some actions (such as spill code insertion) have to be applied to reduce this pressure, at the expense of some performance degradation. This degradation could be avoided if a high-capacity register file were included without causing a negative impact on the cycle time of the processor. The authors propose a two-level hierarchical register file organization for VLIW architectures that combines high capacity and low access time. For the configuration proposed in the paper, the new organization achieves a speed-up of 10-14% over a monolithic organization with 64 registers; it is obtained with a 43% (40%) reduction in area (peak power dissipation). Compared to a monolithic file with 32 registers, the speed-up is as much as 38% with just a 14% (4%) increase in area (peak power dissipation).

international symposium on microarchitecture | 2001

Modulo scheduling with integrated register spilling for clustered VLIW architectures

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

Clustering is a technique to decentralize the design of future wide issue VLIW cores and enable them to meet the technology constraints in terms of cycle time, area and power dissipation. In a clustered design, registers and functional units are grouped in clusters so that new instructions are needed to move data between them. New aggressive instruction scheduling techniques are required to minimize the negative effect of resource clustering and delays in moving data around. In this paper we present a novel software pipelining technique that performs instruction scheduling with reduced register requirements, register allocation, register spilling and inter-cluster communication in a single step. The algorithm uses limited backtracking to reconsider previously taken decisions. This backtracking provides the algorithm with additional possibilities for obtaining high throughput schedules with low spill code requirements for clustered architectures. We show that the proposed approach outperforms previously proposed techniques and that it is very scalable independently of the number of clusters, the number of communication buses and communication latency. The paper also includes an exploration of some parameters in the design of future clustered VLIW cores.

international parallel and distributed processing symposium | 2003

Hierarchical clustered register file organization for VLIW processors

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.

International Journal of Parallel Programming | 2004

Software and hardware techniques to optimize register file utilization in VLIW architectures

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

High-performance microprocessors are currently designed with the purpose of exploiting instruction level parallelism (ILP). The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. This paper reviews hardware and software techniques that alleviate the high register demands of aggressive scheduling heuristics on VLIW cores. From the software point of view, instruction scheduling can stretch lifetimes and reduce the register pressure. If more registers than those available in the architecture are required, some actions (such as the injection of spill code) have to be applied to reduce this pressure, at the expense of some performance degradation. From the hardware point of view, this degradation could be reduced if a high-capacity register file were included without causing a negative impact on the design of the processor (cycle time, area and power dissipation). Novel organizations for the register file based on clustering and hierarchical organization are necessary to meet the technology constraints. This paper proposes the used of a clustered organization and proposes an aggressive instruction scheduling technique that minimizes the negative effect of the limitations imposed by the register file organization.

programming language design and implementation | 2000

Improved spill code generation for software pipelined loops

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

Software pipelining is a loop scheduling technique that extractsparallelism out of loops by overlapping the execution of severalconsecutive iterations. Due to the overlapping of iterations, schedules impose high register requirements during their execution. A schedule is valid if it requires at most the number of registers available in the target architecture. If not, its register requirementshave to be reduced either by decreasing the iteration overlapping or by spilling registers to memory. In this paper we describe a set of heuristics to increase the quality of register-constrained modulo schedules. The heuristics decide between the two previous alternatives and define criteria for effectively selecting spilling candidates. The heuristics proposed for reducing the register pressure can be applied to any software pipelining technique. The proposals are evaluated using a register-conscious software pipeliner on a workbench composed of a large set of loops from the Perfect Club benchmark and a set of processor configurations. Proposals in this paper are compared against a previous proposal already described in the literature. For one of these processor configurations and the set of loops that do not fit in the available registers (32), a speed-up of 1.68 and a reduction of the memory traffic by a factor of 0.57 are achieved with an affordable increase in compilation time. For all the loops, this represents a speed-up of 1.38 and a reduction of the memory traffic by a factor of 0.7.

IEEE Transactions on Parallel and Distributed Systems | 2004

Register constrained modulo scheduling

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

Software pipelining is an instruction scheduling technique that exploits the instruction level parallelism (ILP) available in loops by overlapping operations from various successive loop iterations. The main drawback of aggressive software pipelining techniques is their high register requirements. If the requirements exceed the number of registers available in the target architecture, some steps need to be applied to reduce the register pressure (incurring some performance degradation): reduce iteration overlapping or spilling some lifetimes to memory. In the first part, we propose a set of heuristics to improve the spilling process and to better decide between adding spill code or directly decreasing the execution rate of iterations. The experimental evaluation, over a large number of representative loops and for a processor configuration, reports an increase in performance by a factor of 1.29 and a reduction of memory traffic by a factor of 1.36. In the second part, we analyze the use of backtracking and propose a novel approach for simultaneous instruction scheduling and register spilling in modulo scheduling: MIPS (modulo scheduling with integrated register spilling). The experimental evaluation reports an increase in performance by a factor of 1.46 and a reduction of the memory traffic by a factor of 1.66 (or an additional 1.13 and 1.22 with regard to the proposal in the first part). These improvements are achieved at the expense of a reasonable increase in the compilation time.

languages and compilers for parallel computing | 2001

MIRS: modulo scheduling with integrated register spilling

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

The overlapping of loop iterations in software pipelining techniques imposes high register requirements. The schedule for a loop is valid if it requires at most the number of registers available in the target architecture. Otherwise its register requirements have to be reduced by spilling registers to memory. Previous proposals for spilling in software pipelined loops require a two-step process. The first step performs the actual instruction scheduling without register constraints. The second step adds (if required) spill code and reschedules the modified loop. The process is repeated until a valid schedule, requiring no more registers than those available, is found. The paper presents MIRS (Modulo scheduling with Integrated Register Spilling), a novel register-constrained modulo scheduler that performs modulo scheduling and register spilling simultaneously in a single step. The algorithm is iterative and uses backtracking to undo previous scheduling decisions whenever resource or dependence conflicts appear. MIRS is compared against a state-of-the-art two-step approach already described in the literature. For this purpose, a workbench composed of a large set of loops from the Perfect Club and a set of processor configurations are used. On the average, for the loops that require spill code a speed-up in the range 14-31% and a reduction of the memory traffic by a factor in the range 0.90-0.72 are achieved.

International Journal of Embedded Systems | 2008

Power-efficient VLIW design using clustering and widening

Miquel Pericàs; Eduard Ayguadé; Javier Zalamea; Josep Llosa; Mateo Valero

Media applications exhibit large quantities of Instruction-Level Parallelism (ILP), particularly inside of loops. However, exploited ILP is limited by available resources and loop recurrences. To overcome this, current designs replicate memory ports and functional units. But as the number of units grows, the efficiency also reduces dramatically. Clustering and widening are two techniques for enabling wide issue-cores to meet technology constraints in terms of cycle time, area and power. In this paper we evaluate several VLIW designs that use these techniques. From the study we conclude that either clustering, widening or both can yield power-efficient configurations with little area requirements.

ieee international conference on high performance computing data and analytics | 2004

High-performance and low-power VLIW cores for numerical computations

Miquel Pericàs; Eduard Ayguadé; Javier Zalamea; Josep Llosa; Mateo Valero

Issue logic is among the worst scaling structures in a modern microprocessor. Increasing the issue width increments the processor area in an exponential way. Bigger processors will have inherently larger wire delays. In this scenario, technology scaling will yield smaller performance improvements as the wire delays do not decrease. Instead, they start to dominate the clock cycle. In order to offer higher performance the wire problem needs to be tackled. This paper discusses two methods which attempt to move the wire problem out of the critical path. The first method is the clustering technique, which directly approaches the wire problem by combining several smaller execution cores in the processor backend to perform the computations. Each core has a smaller issue width and a much smaller area. The second technique we study is the widening technique. This technique consists in reducing the issue width of the processor, but giving the instructions SIMD capabilities. The parallelism here is small (normally two to four) and does not resemble multimedia or vector extensions. Wide processors use wide functional units that compute the same operation on multiple words. The rationale behind this idea is that by reducing the issue width (but not the computational bandwidth), we are also reducing the issue logic circuitry and the complexity of structures such as the register file and the cache memory. When compared with a centralised core with 128 registers, 8 FPUs and 4 memory ports, our approach, using an equivalent amount of hardware units, is able to achieve speedups up to 1.7.

ieee international conference on high performance computing data and analytics | 2003

Power-Performance Trade-Offs in Wide and Clustered VLIW Cores for Numerical Codes

Miquel Pericàs; Eduard Ayguadé; Javier Zalamea; Josep Llosa; Mateo Valero

Instruction-Level Parallelism (ILP) is the main source of performance achievable in numerical applications. Architecturalresources and program recurrences are the main limitations to the amount of ILP exploitable from loops, the most time-consuming part in numerical computations. In order to increase the issue rate, current designs use growing degrees of resource replication for memory ports and functional units. But the high costs in terms of power, area and clock cycle of this technique are making it less attractive.

Explore More