Is this you? Create Your Porfile

Josep Llosa

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Josep Llosa is active.

Explore More

Publication

Featured researches published by Josep Llosa.

high-performance computer architecture | 2004

Out-of-order commit processors

Adrian Cristal; Daniel Ortega; Josep Llosa; Mateo Valero

Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.

international conference on parallel architectures and compilation techniques | 1996

Swing module scheduling: a lifetime-sensitive approach

Josep Llosa; Antonio González; Eduard Ayguadé; Mateo Valero

This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements and stage count. Swing Modulo Scheduling is an heuristic approach that has a low computational cost. The paper describes the technique and evaluates it for the Perfect Club benchmark suite. SMS is compared with other heuristic methods showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. SMS is also compared with an integer linear programming approach that generates optimum schedules but with a huge computational cost, which makes it feasible only for very small loops. For a set of small loops, SMS obtained the optimum initiation interval in all the cases and its schedules required only 5% more registers and a 1% higher stage count than the optimum.

international symposium on microarchitecture | 2000

Two-level hierarchical register file organization for VLIW processors

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

High-performance microprocessors are currently designed to exploit the inherent instruction level parallelism (ILP) available in most applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. If more registers than those available in the architecture are required, some actions (such as spill code insertion) have to be applied to reduce this pressure, at the expense of some performance degradation. This degradation could be avoided if a high-capacity register file were included without causing a negative impact on the cycle time of the processor. The authors propose a two-level hierarchical register file organization for VLIW architectures that combines high capacity and low access time. For the configuration proposed in the paper, the new organization achieves a speed-up of 10-14% over a monolithic organization with 64 registers; it is obtained with a 43% (40%) reduction in area (peak power dissipation). Compared to a monolithic file with 32 registers, the speed-up is as much as 38% with just a 14% (4%) increase in area (peak power dissipation).

IEEE Transactions on Computers | 2001

Lifetime-sensitive modulo scheduling in a production environment

Josep Llosa; Eduard Ayguadé; Antonio González; Mateo Valero; Jason Eckhardt

This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements, and stage count. Swing Modulo Scheduling is a heuristic approach that has a low computational cost. This paper first describes the technique and evaluates it for the Perfect Club benchmark suite on a generic VLIW architecture. SMS is compared with other heuristic methods, showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. To further explore the effectiveness of SMS, the experience of incorporating it into a production quality compiler for the Equator MAP1000 processor is described; implementation issues are discussed, as well as modifications and improvements to the original algorithm. Finally, experimental results from using a set of industrial multimedia applications are presented.

Physical Review D | 2001

Hamiltonian formalism for space-time noncommutative theories

Joaquim Gomis; Kiyoshi Kamimura; Josep Llosa

Departament de F´isica Fonamental, Universitat de Barcelona, Diagonal 647, E-08028 Barcelona, SpainSpace-time non-commutative theories are non-local in time. We develop the Hamiltonian formal-ism for non-local ﬁeld theories in d space-time dimensions by considering auxiliary d+1 dimensionalﬁeld theories which are local with respect to the evolution time. The Hamiltonian path integralquantization is considered and the Feynman rules in the Lagrangian formalism are derived. Thecase of non-commutative φ

international symposium on microarchitecture | 2001

Modulo scheduling with integrated register spilling for clustered VLIW architectures

Javier Zalamea; Josep Llosa; Eduard Ayguadé; Mateo Valero

Clustering is a technique to decentralize the design of future wide issue VLIW cores and enable them to meet the technology constraints in terms of cycle time, area and power dissipation. In a clustered design, registers and functional units are grouped in clusters so that new instructions are needed to move data between them. New aggressive instruction scheduling techniques are required to minimize the negative effect of resource clustering and delays in moving data around. In this paper we present a novel software pipelining technique that performs instruction scheduling with reduced register requirements, register allocation, register spilling and inter-cluster communication in a single step. The algorithm uses limited backtracking to reconsider previously taken decisions. This backtracking provides the algorithm with additional possibilities for obtaining high throughput schedules with low spill code requirements for clustered architectures. We show that the proposed approach outperforms previously proposed techniques and that it is very scalable independently of the number of clusters, the number of communication buses and communication latency. The paper also includes an exploration of some parameters in the design of future clustered VLIW cores.

high-performance computer architecture | 1999

Distributed modulo scheduling

Marcio Merino Fernandes; Josep Llosa; Nigel P. Topham

Wide-issue ILP machines can be built using the VLIW approach as many of the hardware complexities found in superscalar processors can be transferred to the compiler. However, the scalability of VLIW architectures is still constrained by the size and number of ports of the register file required by a large number of functional units. Organizations composed of clusters of a few functional units and small private register files have been proposed to deal with this problem; an approach highly dependent on scheduling and partitioning strategies. The paper presents DMS, an algorithm that integrates modulo scheduling and code partitioning in a single procedure. Experimental results have shown that the algorithm is effective for configurations up to 8 clusters, or even more when targeting vectorizable loops.

international conference on supercomputing | 2002

A comparative study of modulo scheduling techniques

Josep M. Codina; Josep Llosa; Antonio González

Modulo Scheduling is an instruction scheduling technique that is used by many current compilers. Different approaches have been proposed in the past but there is not a quantitative comparison among them, using the same compiling platform, benchmarks and architectures.This paper presents a performance comparison of the most relevant Modulo Scheduling techniques, based on a detailed quantitative evaluation of them. The results point out which are the most effective techniques for different architectures, which is useful for compiler designers when choosing the most appropriate technique for a particular processor architecture.

ACM Transactions on Programming Languages and Systems | 2004

A fast and accurate framework to analyze and optimize cache memory behavior

Xavier Vera; Nerina Bermudo; Josep Llosa; Antonio González

The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature.This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.

Tropical Medicine & International Health | 2008

Dynamics of dengue epidemics in urban contexts

Puntani Pongsumpun; D. Garcia Lopez; Charly Favier; L. Torres; Josep Llosa; Marc A. Dubois

Dengue, similar to other arboviral diseases, exhibits complex spatiotemporal dynamics. Even at town or village level, individual‐based spatially explicit models are required to correctly reproduce epidemic curves. This makes modelling at the regional level (province, country or continent) very difficult and cumbersome. We propose here a first step to build a hierarchized model by constructing a simple analytical expression which reproduces the model output from macroscopic parameters describing each ‘village’. It also turns out to be a good approximation of real urban epidermic outbreaks. Subsequently, a regional model could be built by coupling these equations on a lattice.

Explore More