Jakub Kurzak | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jakub Kurzak is active.

Explore More

Publication

Featured researches published by Jakub Kurzak.

Journal of Physics: Conference Series | 2009

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects

Emmanuel Agullo; James W. Demmel; Jack J. Dongarra; Bilel Hadri; Jakub Kurzak; Julien Langou; Hatem Ltaief; Piotr Luszczek; Stanimire Tomov

The emergence and continuing use of multi-core architectures and graphics processing units require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) and Matrix Algebra on GPU and Multics Architectures (MAGMA) are two projects that aims to achieve high performance and portability across a wide range of multi-core architectures and hybrid systems respectively. We present in this document a comparative study of PLASMAs performance against established linear algebra packages and some preliminary results of MAGMA on hybrid multi-core and GPU systems.

parallel computing | 2006

The impact of multicore on math software

Alfredo Buttari; Jack J. Dongarra; Jakub Kurzak; Julien Langou; Piotr Luszczek; Stanimire Tomov

Power consumption and heat dissipation issues are pushing the microprocessors industry towards multicore design patterns. Given the cubic dependence between core frequency and power consumption, multicore technologies leverage the idea that doubling the number of cores and halving the cores frequency gives roughly the same performance reducing the power consumption by a factor of four. With the number of cores on multicore chips expected to reach tens in a few years, efficient implementations of numerical libraries using shared memory programming models is of high interest. The current message passing paradigm used in ScaLAPACK and elsewhere introduces unnecessary memory overhead and memory copy operations, which degrade performance, along with the making it harder to schedule operations that could be done in parallel. Limiting the use of shared memory to fork-join parallelism (perhaps with OpenMP) or to its use within the BLAS does not address all these issues.

IEEE Transactions on Parallel and Distributed Systems | 2008

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

Jakub Kurzak; Alfredo Buttari; Jack J. Dongarra

The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in processor architecture. At the same time it presents new challenges for the development of numerical algorithms. One is effective exploitation of the differential between the speed of single and double precision arithmetic; the other is efficient parallelization between the short vector SIMD cores. The first challenge is addressed by utilizing the well known technique of iterative refinement for the solution of a dense symmetric positive definite system of linear equations, resulting in a mixed-precision algorithm, which delivers double precision accuracy, while performing the bulk of the work in single precision. The main contribution of this paper lies in addressing the second challenge by successful thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization. The implementation of the computationally intensive sections gets within 90 percent of peak floating point performance, while the implementation of the memory intensive sections reaches within 90 percent of peak memory bandwidth. On a single CELL processor, the algorithm achieves over 170~Gflop/s when solving a symmetric positive definite system of linear equation in single precision and over 150~Gflop/s when delivering the result in double precision accuracy.

ieee international conference on high performance computing data and analytics | 2007

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems

Alfredo Buttari; Jack J. Dongarra; Julie Langou; Julien Langou; Piotr Luszczek; Jakub Kurzak

By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor. Results on modern processor architectures and the Cell BE are presented.

Computer Physics Communications | 2009

Accelerating scientific computations with mixed precision algorithms

Marc Baboulin; Alfredo Buttari; Jack J. Dongarra; Jakub Kurzak; Julie Langou; Julien Langou; Piotr Luszczek; Stanimire Tomov

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

George Bosilca; Aurelien Bouteiller; Anthony Danalis; Mathieu Faverge; Azzam Haidar; Thomas Herault; Jakub Kurzak; Julien Langou; Pierre Lemarinier; Hatem Ltaief; Piotr Luszczek; Asim YarKhan; Jack J. Dongarra

We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for high performance computing and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributed-memory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problem-size, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architecture-aware scheduling and management of tasks. The originality of this engine lies in its capacity to translate a sequential code with nested-loops into a concise and synthetic format which can then be interpreted and executed in a distributed environment. We present three common dense linear algebra algorithms from PLASMA~(Parallel Linear Algebra for Scalable Multi-core Architectures), namely: Cholesky, LU, and QR factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate through experimental results on the Cray XT5 Kraken system that our DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.

ACM Transactions on Mathematical Software | 2008

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

Alfredo Buttari; Jack J. Dongarra; Jakub Kurzak; Piotr Luszczek; Stanimire Tomov

By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techniques and sparse iterative techniques such as Krylov subspace methods. The approach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor.

conference on high performance computing (supercomputing) | 2006

Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

Julie Langou; Julien Langou; Piotr Luszczek; Jakub Kurzak; Alfredo Buttari; Jack J. Dongarra

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 bit floating point arithmetic (double precision). Examples include the Intels Pentium IV and M processors, AMDs Opteron architectures and the IBMs Cell Broad Engine processor. When working in single precision, floating point operations can be performed up to two times faster on the Pentium and up to ten times faster on the Cell over double precision. The performance enhancements in these architectures are derived by accessing extensions to the basic architecture, such as SSE2 in the case of the Pentium and the vector functions on the IBM Cell. The motivation for this paper is to exploit single precision operations whenever possible and resort to double precision at critical stages while attempting to provide the full double precision results. The results described here are fairly general and can be applied to various problems in linear algebra such as solving large sparse systems, using direct or iterative methods and some eigenvalue problems. There are limitations to the success of this process, such as when the conditioning of the problem exceeds the reciprocal of the accuracy of the single precision computations. In that case the double precision algorithm should be used

Concurrency and Computation: Practice and Experience | 2007

Implementation of mixed precision in solving systems of linear equations on the Cell processor

Jakub Kurzak; Jack J. Dongarra

This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for the Cell processor. It describes in detail the implementation of code to solve linear system of equations using Gaussian elimination in single precision with iterative refinement of the solution to the full double‐precision accuracy. By utilizing this approach the algorithm achieves close to an order of magnitude higher performance on the Cell processor than the performance offered by the standard double‐precision algorithm. The code is effectively an implementation of the high‐performance LINPACK benchmark, as it meets all of the requirements concerning the problem being solved and the numerical properties of the solution. Copyright

IEEE Transactions on Parallel and Distributed Systems | 2012

Autotuning GEMM Kernels for the Fermi GPU

Jakub Kurzak; Stanimire Tomov; Jack J. Dongarra

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

Explore More