Is this you? Create Your Porfile

Juan J. Navarro

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan J. Navarro is active.

Explore More

Publication

Featured researches published by Juan J. Navarro.

international symposium on computer architecture | 1998

Dynamic history-length fitting: a third level of adaptivity for branch prediction

Toni Juan; Sanji Sanjeevan; Juan J. Navarro

Accurate branch prediction is essential for obtaining high performance in pipelined superscalar processors that execute instructions speculatively. Some of the best current predictors combine a part of the branch address with a fixed amount of global history of branch outcomes in order to make a prediction. These predictors cannot perform uniformly well across all workloads because the best amount of history to be used depends on the code, the input data and the frequency of context switches. Consequently, all predictors that use a fixed history length are therefore unable to perform up to their maximum potential.We introduce a method---called DHLF---that dynamically determines the optimum history length during execution, adapting to the specific requirements of any code, input data and system workload. Our proposal adds an extra level of adaptivity to two-level adaptive branch predictors. The DHLF method can be applied to any one of the predictors that combine global branch history with the branch address. We apply the DHLF method to gshare (dhlf-gshare) and obtain near-optimal results for all SPECint95 benchmarks, with and without context switches. Some results are also presented for gskewed (dhlf-gskewed), confirming that other predictors can benefit from our proposal.

international symposium on low power electronics and design | 1997

Reducing TLB power requirements

Toni Juan; Tomás Lang; Juan J. Navarro

Translation look-aside buffers (TLBs) are small caches to speed-up address translation in processors with virtual memory. This paper considers two issues: (1) a comparison of the power consumption of fully-associative, set-associative, and direct mapped TLBs for the same miss rate and (2) the proposal of modifications of the basic cells and of the structure of set-associative TLBs to reduce the power. The power evaluation is done using a model and the miss rates are obtained from simulations of the SPEC92 benchmark. With respect to (1) we conclude that for small TLBs (high miss rates) fully-associative TLBs consume less power but for larger TLBs (low miss rates) set-associative TLBs are better. Moreover, the proposed modifications produce significant reductions in power consumption. Our evaluations show a reduction of 40 to 60% compared to the best traditional TLB. The proposed TLB implementation produces an increase in delay and in area. However, these increases are tolerable because the cycle time is determined by the slower cache and because the TLB area corresponds to only a small portion of the chip area.

international conference on supercomputing | 1997

Data caches for superscalar processors

Toni Juan; Juan J. Navarro; Olivier Temam

As the number of instructions executed in parallel increases, superscalar processors will require higher bandwidth from data caches. Because of the high cost of true multi-ported caches, alternative cache designs must be evaluated. The purpose of this study is to examine the data cache bandwidth requirements of high-degree superscalar processors, and investigate alternative solutions. The designs studied range from classic solutions like multi-banked caches to more complex solutions recently proposed in the literature. The performance tradeoffs of these different cache designs are examined in details. Then, using a chip area cost model, all solutions are compared with respect to both cost and performance. While many cache designs seem capable of achieving high cache bandwidth, the best cost/performance t,radeoff varies significantly depending on the dedicated area cost, ranging from multi-banked cache designs to hybrid multi-banked/multi-ported caches or even true multi-ported caches. For instance, we find that an 8-bank cache with minor optimizations perform 10% better than a true a-port cache at half the cost, or that a 4-bank 2 ports per bank cache performs better than a true 4-port cache and uses 45% less chip area.

international symposium on computer architecture | 1996

The Difference-Bit Cache

Juan J. Navarro; Tom Lang; Toni Juan

The difference-bit cache is a two-way set-associative cache with an access time that is smaller than that of a conventional one and close or equal to that of a direct-mapped cache. This is achieved by noticing that the two tags for a set have to differ at least by one bit and by using this bit to select the way. In contrast with previous approaches that predict the way and have two types of hits (primary of one cycle and secondary of two to four cycles), all hits of the difference-bit cache are of one cycle. The evaluation of the access time of our cache organization has been performed using a recently proposed on-chip cache access model.

international conference on supercomputing | 1994

MOB forms: a class of multilevel block algorithms for dense linear algebra operations

Juan J. Navarro; Toni Juan; Tomás Lang

Multilevel block algorithms exploit the data locality in linear algebra operations when executed in machines with several levels in the memory hierarchy. It is shown that the family we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design and that using the multilevel approach produces significant performance improvements. The effect of interference in the cache, of the TLB misses, and of page faults are also considered. The multilevel block algorithms are evaluated analytically for an ideal memory system with M cache levels without interferences. Moreover, experimental results of the MOB forms in some present high performance workstations are presented.

international conference on supercomputing | 1996

Block algorithms for sparse matrix computations on high performance workstations

Juan J. Navarro; Elena García-Diego; Josep-lluis Larriba-pey; Toni Juan

In this paper we analyze the use of Blocklng (tiling), Data Precopying and Software Pipelining to improve the performance of sparse matrix computations on superscalar workstations. In particular, we analyze the case of the Sparse Matrix by dense Matrix operation. The analysis focusses on the practical aspects that can be observed when programming such problem on present workstations with several memory levels. The problem is studied on the Alpha 21064 based workstation DEC 3000/800. Simulations of the memory hierarchy are also used to understand the behaviour of the algorithms. The results obtained show that there is a clear difference between the dense case and the sparse case in terms of the compromises to be adopted to optimize the algorithms. The analysis can be of interest to numerical library and compiler designers.

parallel, distributed and network-based processing | 2003

CC-Radix: a cache conscious sorting based on Radix sort

Daniel Jiménez-González; Juan J. Navarro; Josep-lluis Larriba-pey

We focus on the improvement of data locality for the in-core sequential Radix sort algorithm for 32-bit positive integer keys. We propose a new algorithm that we call Cache Conscious Radix sort, CC-Radix. CC-Radix improves the data locality by dynamically partitioning the data set into subsets that fit in cache level L/sub 2/. Once in that cache level, each subset is sorted with Radix sort. In order to obtain the best implementations, we analyse the algorithms and obtain the algorithmic parameters that minimize the number of misses on cache levels L/sub 1/ and L/sub 2/, and the TLB structure. Here, we present results for a MIPS R10000 processor based computer, the SGI Origin 2000. Our results show that our algorithm is about 2 and 1.4 times faster than Quicksort and Explicit Block Transfer Radix sort, which is the previous fastest sorting algorithm to our knowledge, respectively.

Applicable Algebra in Engineering, Communication and Computing | 2007

Analysis of a sparse hypermatrix Cholesky with fixed-sized blocking

José R. Herrero; Juan J. Navarro

We present the way in which we have constructed an implementation of a sparse Cholesky factorization based on a hypermatrix data structure. This data structure is a storage scheme which produces a recursive 2D partitioning of a sparse matrix. It can be useful on some large sparse matrices. Subblocks are stored as dense matrices. Thus, efficient BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. We present the ways in which we deal with this overhead. Using matrices from different areas (Interior Point Methods of linear programming and Finite Element Methods), we evaluate our sequential in-core hypermatrix sparse Cholesky implementation. We compare its performance with several other codes and analyze the results. In spite of using a simple fixed-size partitioning of the matrix our code obtains competitive performance.

european conference on parallel processing | 2003

Improving Performance of Hypermatrix Cholesky Factorization

José R. Herrero; Juan J. Navarro

This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the resulting executable and keep its Mflops. The best combination is then used to produce the object introduced in a library. Thus, a routine for each desired matrix size is available from the library. The large overhead incurred by the hypermatrix Cholesky factorization of sparse matrices can therefore be lessened by reducing the block size when those routines are used. Using the routines, e.g. matrix multiplication, in our small matrix library produced important speed-ups in our sparse Cholesky code.

international conference of the chilean computer science society | 1997

An analysis of superscalar sorting algorithms on an R8000 processor

Josep-L Larriba-Pey; Daniel Jiménez; Juan J. Navarro

We compare and analyze different in-memory sorting algorithms to understand their behavior on a superscalar MIPS R8000 processor. We explore Quick sort, Heap sort and an implementation variant of Radix sort that we propose. We compare the methods isolated and combined with Multiway merge and Bucket sort. The combination of methods helps to check for potential use of locality. We describe and analyze the models of the most significant algorithms. Some conclusions can be drawn from this work. First, Radix sort is the fastest algorithm. Second, the use of combined methods does not help to exploit locality. Third, with the help of the models and an analysis of the codes, it is possible to understand that Radix sort is the most promising of the methods studied here for future superscalar architectures.

Explore More