Jerzy Waśniewski
Technical University of Denmark
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jerzy Waśniewski.
ACM Transactions on Mathematical Software | 2001
Bjarne Stig Andersen; Jerzy Waśniewski; Fred G. Gustavson
A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format are included. A new algorithm, called RPC for Recursive Packed Cholesky, that operates on the RPG format is presented. ALgorithm RPC is basd on level-3 BLAS and requires variants of algorithms TRSM and SYRK that work on RPF. We call these RP_TRSM and RP_SYRK and find that they do most of their work by calling GEMM. It follows that most of the execution time of RPC lies in GEMM. The advantage of this storage scheme compared to traditional packed and full storage is demonstrated. First, the RPC storage format uses the minimal amount of storage for the symmetric or triangular matrix. Second, RPC gives a level-3 implementation of Cholesky factorization whereas standard packed implementations are only level 2. Hence, the performance of our RPC implementation is decidedly superior. Third, unlike fixed block size algorithms, RPC, requires no block size tuning parameter. We present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines. Also MSP parallel computations on the IBM SMP computer are made. The graphs that are attached in Section 7 show that the RPC algorithms are superior by a factor between 1.6 and 7.4 for order around 1000, and between 1.9 and 10.3 for order around 3000 over the traditional packed algorithms. For some architectures, the RPC performance results are almost the same or even better than the traditional full-storage algorithms results.
ACM Transactions on Mathematical Software | 2005
Bjarne Stig Andersen; John A. Gunnels; Fred G. Gustavson; J. K. Reid; Jerzy Waśniewski
We consider the efficient implementation of the Cholesky solution of symmetric positive-definite dense linear systems of equations using packed storage. We take the same starting point as that of LINPACK and LAPACK, with the upper (or lower) triangular part of the matrix stored by columns. Following LINPACK and LAPACK, we overwrite the given matrix by its Cholesky factor. We consider the use of a hybrid format in which blocks of the matrices are held contiguously and compare this to the present LAPACK code. Code based on this format has the storage advantages of the present code but substantially outperforms it. Furthermore, it compares favorably to using conventional full format (LAPACK) and using the recursive format of Andersen et al. [2001].
parallel computing | 1995
John Brown; Jerzy Waśniewski; Zahari Zlatev
Abstract The concentration of air pollutants have in general been steadily increasing during the last three decades. To correctly gauge the impact of various sources of pollutants requires careful modelling of the complex physics processes associated with the chemistry and transport of air pollution. These models are computationally demanding and require todays fastest high performance computers for practical implementations. The damaging effects are normally due to the combined effects of several air pollutants. Therefore a reliable mathematical model must study simultaneously all relevant air pollutants. This requirement increases the size of the air pollution models. The discretization of models that contain many air pollutants leads to huge computational problems. After the discretization of such an air pollution model, systems of several hundred thousands (or even several millions) of equations arise and these have to be treated numerically during many time-steps (typically several thousands). Such big computational problems can successfully be treated only on big modern vector and/or parallel computers. However, access to a fast high-speed computer is not sufficient. One must also ensure that the great potential power of the computer is correctly exploited. Very often this is a rather difficult task. The efforts to solve this task in the case where the computer under consideration is a massively parallel machine will be discussed in this paper. Tests performed on several such computers will be presented.
ACM Transactions on Mathematical Software | 2010
Fred G. Gustavson; Jerzy Waśniewski; Jack J. Dongarra; Julien Langou
We describe a new data format for storing triangular, symmetric, and Hermitian matrices called Rectangular Full Packed Format (RFPF). The standard two-dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full-format representation. Also, RFPF requires exactly the same minimal storage as packed the format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full-format routine and two calls to Level 3 BLAS routines. This means no new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution, and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using the standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines.
parallel computing | 2006
Fred G. Gustavson; Jerzy Waśniewski
We describe a new data format for storing triangular and symmetric matrices called RFP (Rectangular Full Packed). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to store triangular and symmetric matrices waste nearly half the storage space but provide high performance via the use of level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there are no level 3 packed BLAS. We combine the good features of packed and full storage using RFP format to obtain high performance using L3 (level 3) BLAS as RFP is full format. Also, RFP format requires exactly the same minimal storage as packed format. Each full and/or packed symmetric/triangular routine becomes a single new RFP routine. We present LAPACK routines for Cholesky factorization, inverse and solution computation in RFP format to illustrate this new work and to describe its performance on the IBM, Itanium, NEC, and SUN platforms. Performance of RFP versus LAPACK full routines for both serial and SMP parallel processing is about the same while using half the storage. Performance is roughly one to a factor of 33 for serial and one to a factor of 100 for SMP parallel times faster than LAPACK packed routines. Existing LAPACK routines and vendor LAPACK routines were used in the serial and the SMP parallel study, respectively. In both studies vendor L3 BLAS were used.
ACM Transactions on Mathematical Software | 2007
Fred G. Gustavson; J. K. Reid; Jerzy Waśniewski
We present subroutines for the Cholesky factorization of a positive-definite symmetric matrix and for solving corresponding sets of linear equations. They exploit cache memory by using the block hybrid format proposed by the authors in a companion article. The matrix is packed into n(n + 1)/2 real variables, and the speed is usually better than that of the LAPACK algorithm that uses full storage (n2 variables). Included are subroutines for rearranging a matrix whose upper or lower-triangular part is packed by columns to this format and for the inverse rearrangement. Also included is a kernel subroutine that is used for the Cholesky factorization of the diagonal blocks since it is suitable for any positive-definite symmetric matrix that is small enough to be held in cache. We provide a comprehensive test program and simple example programs.
Archive | 1996
Jack J. Dongarra; Kaj Madsen; Jerzy Waśniewski
A high performance matrix multiplication algorithm for MPPs.- Iterative moment method for electromagnetic transients in grounding systems on CRAY T3D.- Analysis of crystalline solids by means of a parallel FEM method.- Parallelization strategies for Tree N-body codes.- Numerical solution of stochastic differential equations on transputer network.- Development of a stencil compiler for one-dimensional convolution operators on the CM-5.- Automatic parallelization of the AVL FIRE benchmark for a distributed-memory system.- 2-D cellular automata and short range molecular dynamics programs for simulations on networked workstations and parallel computers.- Pablo-based performance monitoring tool for PVM applications.- Linear algebra computation on parallel machines.- A neural classifier for radar images.- ScaLAPACK: A portable linear algebra library for distributed memory computers - Design issues and performance.- A proposal for a set of parallel basic linear algebra subprograms.- Parallel implementation of a Lagrangian stochastic particle model of turbulent dispersion in fluids.- Reduction of a regular matrix pair (A, B) to block Hessenberg-triangular form.- Parallelization of algorithms for neural networks.- Paradigms for the parallelization of Branch&Bound algorithms.- Three-dimensional version of the Danish Eulerian Model.- A proposal for a Fortran 90 interface for LAPACK.- ScaLAPACK tutorial.- Highly parallel concentrated heterogeneous computing.- Adaptive polynomial preconditioning for the conjugate gradient algorithm.- The IBM parallel engineering and scientific subroutine library.- Some preliminary experiences with sparse BLAS in parallel iterative solvers.- Load balancing in a Network Flow Optimization code.- User-level VSM optimization and its application.- Benchmarking the cache memory effect.- Efficient Jacobi algorithms on multicomputers.- Front tracking: A parallelized approach for internal boundaries and interfaces.- Program generation techniques for the development and maintenance of numerical weather forecast Grid models.- High performance computational chemistry: NWChem and fully distributed parallel applications.- Parallel ab-initio molecular dynamics.- Dynamic domain decomposition and load balancing for parallel simulations of long-chained molecules.- Concurrency in feature analysis.- A parallel iterative solver for almost block-diagonal linear systems.- Distributed general matrix multiply and add for a 2D mesh processor network.- Distributed and parallel computing of short-range molecular dynamics.- Lattice field theory in a parallel environment.- Parallel time independent quantum calculations of atom diatom reactivity.- Parallel oil reservoir simulation.- Formal specification of multicomputers.- Multi-million particle molecular dynamics on MPPs.- Wave propagation in urban microcells: a massively parallel approach using the TLM method.- The NAG Numerical PVM Library.- Cellular automata modeling of snow transport by wind.- Parallel algorithm for mapping of parallel programs into pyramidal multiprocessor.- Data-parallel molecular dynamics with neighbor-lists.- Visualizing astrophysical 3D MHD turbulence.- A parallel sparse QR-factorization algorithm.- Decomposing linear programs for parallel solution.- A parallel computation of the Navier-Stokes equation for the simulation of free surface flows with the volume of fluid method.- Improving the performance of parallel triangularization of a sparse matrix using a reconfigurable multicomputer.- Comparison of two image-space subdivision algorithms for Direct Volume Rendering on distributed-memory multicomputers.- Communication harnesses for transputer systems with tree structure and cube structure.- A thorough investigation of the projector quantum Monte Carlo method using MPP technologies.- Distributed simulation of a set of elastic macro objects.- Parallelization of ab initio molecular dynamics method.- Parallel computations with large atmospheric models.
ACM Transactions on Mathematical Software | 2013
Fred G. Gustavson; Jerzy Waśniewski; Jack J. Dongarra; José R. Herrero; Julien Langou
Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts decreasing. This is our main result and it implies, due to the use of larger block size nb, that DGEMM, DSYRK, and DTRSM performance also increases! The four DPOTF3i routines use simple register blocking. Different platforms have different numbers of registers. Thus, our four routines have different register blocking sizes. BPF is introduced. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two variants of BPF: lower and upper. Upper BPF is “identical” to Square Block Packed Format (SBPF). “LAPACK” implementations on multicore processors use SBPF. Lower BPF is less efficient than upper BPF. Vector inplace transposition converts lower BPF to upper BPF very efficiently. Corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms are given for n ≈ nb as well as results for large n comparing DBPTRF versus DPOTRF.
Archive | 1994
Jack J. Dongarra; Jerzy Waśniewski
Computationally complex problems cannot be solved on a single computer. They need to be run in an environment of 100 to 1000 processors or more. Designing algorithms to efficiently execute in such a parallel computation environment requires a different thinking and mindset than designing algorithms for single processor computers. This course is designed to give the students the parallel computation perspective using the MPI framework.
parallel processing and applied mathematics | 2011
Jerzy Waśniewski
This document outlines my 57-year career in computational mathematics, a career that took me from Poland to Canada and finally to Denmark. It of course spans a period in which both hardware and software developed enormously. Along the way I was fortunate to be faced with fascinating technical challenges and privileged to be able to share them with inspiring colleagues. From the beginning, my work to a great extent was concerned, directly or indirectly, with computational linear algebra, an interest I maintain even today.